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Enclosed you will find the second revision of the EV6 specification. 


This version includes a major rewrite of the external interface, substantial changes to the PAL/IPR 
sections, as well as inclusion of PAL coding restrictions and some electrical and packaging information. 


As the EV6 design proceeds, we are filling in the details of the following topics: 


e Electrical and packaging information 
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e Test and debug features 

e PLL Operation 

e Error handling 


We will send further documentation on these areas (plus errata/changes to rev. 2.0) when available. 


Please note that the EV6 specification is Digital Confidential. Refer all requests for copies to Sue Jacquart 
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1. EV6 and the Alpha Architecture 


This section describes the ways in which EV6 architecturally differs from prior Alpha implementations. 
These architectural differences fall into four classes: 


i; 
2 


3. 
4, 


Extensions to the Alpha Architecture, such as new instructions. 

Architectural features which the Alpha SRM defines as implementation-specific, such as the size of 
the virtual or physical address space. 

Instruction set features which the Alpha SRM defines as optional. 

Arithmetic exceptions 


Alpha SRM version 6.0 and appendix Z of that document are included by reference. 


1.1 Alpha Architectural Extensions 
EV6 includes the following instruction set extensions to the Alpha Architecture: 


Floating point square root for both VAX and IEEE formats 
Population Count - counts the number of ones in an integer register: CTPOP 
Leading and trailing zero count: CTLZ, CTTZ 
Cache Control Operations: 
=> Evict Data Cache Block: ECB 
= Write Hint: WH64 
Integer to floating and floating to integer register transfers: JTOFS, ITOFF, 17OFT, FTOIS & FTOIT 
Graphics & MultiMedia instructions: Oe 
=> Pixel Error: PERR 
=> Min and Max instructions: MINUB8, MINSB8, MINUW4, MINSW4, MAXUB8, MAXSB8, 
MAXUW4, MAXSW4 
=> Pack and Unpack instructions: PKLB, PKWB, UNPKBL, UNPKBW 
Software-directed prefetch instructions: LDL/LDF/LDG/LDB/LDW/LDS/LDQ/LDT into R31/F31 
Version and architecture extension instructions: AMASK/IMPLVER 
Power-saving feature/instruction: CALL_PAL WTINT 


1.2 Implementation-Specific Features 


8 KB page size | 

48-bit virtual address, with IPR-controlled 43-bit mode 

44-bit physical address with MSB indicating IO space when set 

Loads into R31 and F31 are executed to completion, and memory access violations, alignment faults 
and fault-on-read errors generated by these instructions are reported by hardware. PALcode is 
expected to dismiss these exceptions as required by Alpha SRM ECO 95. See section 2.5 for more 
details on software prefetching with loads into R31F31. 

Integer operate instructions into R31 are dismissed; no arithmetic exceptions are reported. 

Floating point operate instructions into F31 are dismissed; no arithmetic exceptions are reported. 
Load-locked/Store Conditional semantics are, except for the waiver described below, compliant with 
ALPHA SRM ECO 102: 

e There must be no intervening memory operation between the LDx_L and STx_C; the 
presence of a memory operation (LDx,STx) will cause the STx_C to always fail. One 
exception (for which EV6 requires a waiver): if the memory operation is a WH64, the 
STx_C might succeed even in the presence of a store from another processor to the lock 
range. 
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e The physical address of STx_C must specify a location within the naturally aligned 16-byte 
block in physical memory accessed by the preceding LDx_L instruction (in processor issue 
sequence) from the same processor. Otherwise it is unpredictable whether the lock flag will 
be cleared by a store from another processor within the lock range. 


1.3 Instruction Set Features Defined as Optional 


This section describes instruction set features which the Alpha SRM defines as optional, and from which 
EV6 differs in comparison with prior implementations. 


e FETCH and FETCH _M are not implemented 
IEEE floating point support 
=> NaN’s and infinities are generated and propagated in hardware 
=> rounding to plus and minus infinity is supported in hardware (this is also true of EVS, but 
not of EV4) 


1.4 Arithmetic Exceptions 


In EV6 arithmetic exceptions are precise and reported as synchronous traps, and the TRAPB and EXCB 
instructions are processed as NOPs. This behavior is architecturally compliant, but means that the 
software completion rules as currently defined in the Alpha SRM are conservative relative to EV6. These 
rules could simply state that floating operates are not allowed to overwrite their own operands and should 
have their /S qualifier set. - a, 
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2. Internal Architecture 


EV6 is the third-generation implementation of Digital’s Alpha RISC architecture. It is a superscalar CPU 
which performs register renaming, speculative execution and dynamic scheduling in hardware. It contains 
four integer execution units, two of which can perform memory address calculations for load and store 
instructions. It also contains dedicated execution units for floating point add, multiply, divide and square 
root. The on-chip instruction cache is a 64K byte, two-way set associative virtual cache with 64-byte 
blocks. The on-chip data cache is a 64K byte, two-way set associative, virtually indexed, physically 
tagged, write-back cache with 64-byte blocks. 


The external interface consists of two ports - a Bcache port and a System port. The Bcache port is 
controlled entirely by the processor, and is used to interface to a module-level secondary cache which may 
be built from a range of standard synchronous SRAMs. The System port interfaces to the rest of the 
system. The processor contains two external data busses, one 16-bytes wide and the other 8-bytes wide. 
The 16-byte bus is used to support the Bcache port and the 8-byte bus is used to support the System port. 


The chip will initially be fabricated in Digital’s 0.35um CMOS-6 process. The speed distribution will 
center at an internal operating frequency of 550 MHz, though the final bin points are TBD. At 500 MHz, 
power dissipation is estimated to be 60 watts at 2.0 volts. 


2.1 Chip Organization 


EV6 consists of the following internal sections: 


Integer execution unit (Ebox) 

Floating point execution unit (Fbox) 

Instruction fetch, issue and retire unit (Ibox) 
Memory reference unit (Mbox) 

External cache and system interface unit (Cbox) 
Instruction cache (Icache) 

Data cache (Dcache) 


2.1.1 Ebox 


The Ebox is a four-wide integer execution unit which is implemented as two functional unit “clusters” - 
labeled 0 and 1. Each cluster contains a copy of an 80-entry physical register file and two “subclusters”, 
named upper (U) and lower (L). Most instructions have one-cycle latency for consumers which execute 
within the same cluster. There is a one cycle delay associated with producing a value in onc cluster and 
consuming the value in the other cluster. Tne instruction issue queue minimizes the performance effect of 
this cross-cluster delay. 
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The Ebox contains the following resources: 


e Four 64-bit adders, all of which are used to calculate results for integer ADD instructions.. The 
adders in subclusters LO and L1 are used to generate the effective virtual address for load and store 
instructions. 

Four logic units 

Two barrel shifters and associated byte logic - U0 and U1 

two sets of conditional branch logic - U0 and U1 

two copies of an 80-entry register file 
one fully pipelined multiplier, with 7-cycle latency for all integer multiply operations - U1 

one fully pipelined unit with 3-cycle latency. This unit executes the following instructions: 

=> POPC, LOC, TOC 

=> PERR, MINxxx, MAXxxx, UNPKxx, PKxx 


The 80 Ebox register file entries contain storage for the values of the 31 Alpha integer registers (the value 
of R31 is not stored), the values of 8 PAL shadow registers, and 41 results written by instructions that 
have not yet retired. Ignoring cross-cluster delay, the two copies of the Ebox register files contain identical 
values. Each copy of the Ebox register file contains four read ports and six write ports. The four read ports 
are used to source operands to each of the two subclusters within a cluster. Two write ports are used to 
write results generated within the cluster; two write ports are used to write results generated by the other 
cluster; and two write ports are used to write results from load instructions. 


2.1.2 Fbox 


The Fbox is a two-wide floating point execution unit which executes both VAX and IEEE floating point 
instructions. It support IEEE S_floating and T_floating data types and all rounding modes. It also 
supports VAX F_floating and G_floating data types, and provides limited support for D_floating format. 
It contains the following resources: 


a 72-entry physical register file 

a fully pipelined multiplier with four cycle latency 

a fully pipelined adder with four cycle latency 

a nonpipelined divide unit associated with the adder pipeline 

a nonpipelined square root unit associated with the adder pipeline 
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The 72 Fbox register file entries contain storage for the values of the 31 Alpha floating point registers 
other than F31, and 41 values written by instructions that are not yet retired. The Fbox register file 
contains six reads ports and four write ports. Four read ports are used to source operands to the add and 
multiply pipelines, and two read ports are used to source data for store instructions. Two write ports are 
used to write results generated by the add and multiply pipelines, and two write ports are used to write 
results from floating point load instructions. 


2.1.3 Ibox 


The Ibox consists of the following subsections: 
Virtual PC logic 

Instruction-stream translation buffer (ITB) 
Instruction fetch logic 

Register rename maps 

Integer and floating point issue queues 
Exception and interrupt logic 

Retire logic 


2.1.3.1 Virtual PC Logic 


The Virtual PC logic maintains the virtual program counter values for instructions that are in flight. 
There can be up to 80 instructions in 20 successive fetch slots in-flight between the mappers and the end 
of the pipeline, hence the VPC logic contains a 20-deep table to store these fetched VPCs. 


2.1.3.2 Instruction Translation Buffer (ITB) 


The Ibox includes a 128-entry, fully associative translation buffer used to store recently used I-stream 
address translations and page protection information. Each of the entries in the ITB can map 1, 8,64 or = 
512 contiguous 8K byte pages. The allocation scheme is round-robin. The ITB supports an 8-bit ASN and 
contains an ASM bit. The Icache is virtual, hence the ITB is only accessed for I-stream references which 
miss the Icache. The Icache contains the access-check information so a fetch address translation is only 
made if the address missed in the Icache. 


2.1.3.3 Instruction Fetch Logic 


The instruction fetcher reads up to four naturally aligned instructions per cycle from the instruction cache. 
It uses both branch prediction and line prediction io maximize eificiency. It also contains a subroutine 
retum prediction stack and an Icache stream controller. The stream controller generates fetch requests for 
additional icache lines and stores the istream data in the icache. There is no separate buffer to hold 
stream requests. 


2.1.3.4 Register Rename Maps 


The prefetcher forwards instructions to the integer and floating point register rename maps. The rename 
maps perform two functions. First, they serve to eliminate register WAR and WAW dependencies while 
preserving true RAW data dependencies, in order to allow instructions to be dynamically rescheduled. 
Second, they: provide a means of speculatively executing instructions before the control flow previous to 
those instructions is resolved. Note that both exceptions and branch mispredicts represent deviations from 
the control flow predicted by the prefetcher. 

The map logic translates each instruction’s operand register specifiers from the “virtual” register numbers 
in the instruction to the “physical” register numbers which hold the corresponding architecturally correct 
values. The map logic also renames each instruction’s destination register specifier from the virtual 
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number in the instruction to a physical register number chosen from a list of free physical registers, and 
updates the register maps. The map logic can process four instructions per cycle. 


The map logic does not return the physical register which holds the old value of an instruction’s virtual 
destination register to the free list until the instruction retires, which means that the control flow up to 
that instruction has been resolved. If a branch mispredict or exception occurs, the map logic backs the 
contents of both maps up to the state associated with the instruction which triggered the condition, and the 
prefetcher restarts at the appropriate PC. 


At most 20 valid fetch slots containing up to 80 instructions can be in flight between the register maps 
and the end of the machine’s pipeline, where the control flow is finally resolved. The map logic is capable 
of backing the contents of the maps up to the state associated with any of these 80 instructions in a single 
cycle. 


2.1.3.5 Instruction Issue Queues 


The register rename logic places instructions into one of issue queues, from which they are later issued to 
functional units for execution. 


2.1.3.5.1 Integer Queue (IQ) 


The integer queue (IQ) is associated with the Ebox, is 20-deep, and issues instructions of the following 
types at a maximum rate of four operations per cycle: 


integer operates 

integer conditional branches 

unconditional branches - both displacement and memory format 

integer and floating point loads and stores 

PAL-reserved instructions: HW_MTPR, HW_MFPR, HW_LD, HW_ ST, HW_RET 
ITOFx, FTOIx 


Each queue entry physically produces four requests signals - one for each of the Ebox subclusters. A 
queue entry asserts a request when it contains an instruction that can be executed by the subcluster, if the 
instruction’s operand register values are available within the subcluster. There are two arbiters - one for 
the upper subclusters and one for the lower subclusters. Each arbiter picks two of the possible 20 
requesters for service each cycle. A given instruction only requests upper subclusters or lower subclusters, 
but since many instructions can only be executed in one type or another this is not too constraining. For 
example, loads and stores can only go to lower subclusters, and shifts can only go to upper subclusters. 
Instructions which can execute in either upper or lower subclusters, such as adds and logic operations, are 
statically assigned before being placed in the IQ. 


The IQ arbiters pick between simultaneous requesters of a subcluster based on age - older instructions are 
given priority over newer instructions. 


If a given instruction requests both lower subclusters and no older instruction requests a lower subcluster, 
then the arbiter assigns subcluster LO to the instruction. If a given instruction requests both upper 
subclusters and no older instruction requests an upper subcluster, then the arbiter assigns subcluster U1 to 
the instruction. This asymmetry between the upper and lower subcluster arbiters is a circuit 
implementation optimization. 


2.1.3.5.2 Floating Point Queue (FQ) 


The floating point queue is associated with the Fbox, is 15-deep, and issues the following instruction 
types: 


e floating point operates 
e floating point conditional branches 
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e floating point stores 
e floating point register to integer register transfers (ftoi) 


Each queue entry physically produces three request wires - one for the add pipe, one for the mul pipe, and 
one for stores. There are three arbiters, one for each of the add, mul and store pipes. The add and mul 
arbiters pick one requester per cycle, and each of two store pipe arbiters picks one requester per Cycle. 


The FQ arbiters pick between simultaneous requesters of a pipe based on age - older instructions are given 
priority over new instructions. Floating stores and FTOI instructions in even-numbered queue entries 
arbitrate for one store port and floating stores and FTOI instructions in odd-numbered queue entries 
arbitrate for a second store port. 


Floating stores and FTOI instructions are enqueued in both the integer and floating queues. They wait in 
the floating queue until their operand register values are available. They subsequently request service to 
the store arbiter. Upon issue from the floating queue, they signal the corresponding entry in the integer . 
queue to request service. Upon issue from the integer queue, the operation is completed. 


2.1.3.6 Exception and Interrupt Logic 

There are two types of exceptions: faults and synchronous traps. Arithmetic exceptions are precise and 
reported as synchronous traps. 

There are four sources of interrupts: 


e Level sensitive hardware interrupts sourced by the irq_h<5:0> pins. 

e Edge sensitive hardware interrupts generated by the serial line receive pin, performance counter. 
overflows, and hardware corrected read errors. 

e Software interrupts sourced by the software interrupt request (STRR) register. 

e Asynchronous system traps (ASTs). 


Interrupt sources can be individually masked. In addition, AST interrupts are qualified by the current 
processor mode. 


2.1.3.7 Retire Logic 


The Ibox fetches instructions in program order, executes them out of order, and retires them in order. The 
retire logic maintains the correct architectural state of the machine by retiring an instruction only if all 
previous instructions have executed without generating exceptions or branch mispredicts. In effect, 
1caring an. instruction commits the machine to any changes the instruction may have made to software- 
visible state, of which there are three classes: 


e The integer and floating point registers 
e Memory 
e Internal processor registers (including control/status registers and translation buffers). 


The retire logic can sustain a maximum retire rate of eight instructions per cycle, and can retire up to as 
many as eleven instruction in a single cycle. 


2.1.4 On-chip Caches 
EV6 contains two on-chip primary caches implemented with fully static, six transistor CMOS structures. 


2.1.4.1 Instruction Cache 


The instruction cache (Icache) is a 64K byte, virtual cache. Set prediction is used to approximate the 
performance of a two-set cache without slowing the cache access time. Each Icache block contains: 
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16 Alpha instructions (64 bytes) 

Virtual tag bits <47:15> 

An 8-bit address space number (ASN) field 

A 1-bit address space match (ASM) bit 

A 1-bit PALcode bit to indicate physical addressing 

A valid bit 

Data and tag parity protection 

Four access-check bits: K, E, S, U 

Additional predecoded information to assist with instruction processing and fetch control 


2.1.4.2 Data Cache 


The data cache (Dcache) is a 64K byte, two-way set associative, virtually indexed, physically tagged, 
write-back, read/write allocate cache with 64-byte blocks. Each cycle the Deache can perform: . 


two quadword (or shorter) reads to arbitrary addresses, or 

two quadword writes to the same aligned octaword, or 

two non-overlapping less-than-quadword writes to the same aligned quadword, or 
one sequential read and write of the same aligned octaword 


Each Dcache block contains: 


64 data bytes and associated quadword ECC 

Physical tags bits <42:13> 

Valid, dirty, shared, and modified bits ) 
A tag parity bit calculated across the tag, dirty, shared and modified bits 
A bit to control round-robin set allocation (one bit per two cache blocks) 


The dcache contains two sets, each with 512 rows containing 64-byte blocks per row (i.e. 32K bytes of 
data per set). EV6 requires 2 additional bits of virtual address beyond the bits which specify an 8K byte 
page in order to specify a dcache row index. Conceptually, a given virtual address might be found in 4 
distinct places in the dcache, depending on the virtual-to-physical translation for those two bits. EV6 
prevents this aliasing by keeping only one of the four possible translated addresses in the cache at any 
particular time. 


2.1.5 Mbox 


The Mbox is responsible for controlling the Dcache and for ensuring architecturally correct behavior of - 
load and store instructions. It contains the following structures: . 


Load queue (LQ) 

Store queue (SQ) 

Miss address file (MAF) 

D-stream translation buffer (DTB) 


2.1.5.1 Load Queue (LQ) 


The load queue (LQ) is essentially a reorder buffer for load instructions. It contains 32 entries and 
maintains the state associated with load instructions which have been issued to the Mbox but which have 
not delivered their results to the CPU and been retired. The Mbox assigns loads to load queue slots based 
on the order in which they were fetched from the Icache and places them into the load queue after they are 
issued by the IQ. The load queue serves to help ensure correct Alpha memory reference behavior. 
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2.1.5.2 Store Queue (SQ) 


The store queue (SQ) is essentially a reorder buffer and graduation unit for store instructions. It contains 
32 entries and maintains the state associated with store instructions which have been issued to the Mbox 
but which have not both been retired and written to the Dcache. The Mbox assigns stores to store queue 
slots based on the order in which they were fetched from the Icache and places them into the store queue 
after they are issued by the IQ. The store queue holds data associated with stores issued from the IQ until 
they are retired, at which point the store can be allowed to update the Dcache. The store queue also serves 
to help ensure correct Alpha memory reference behavior. 


2.1.5.3 Miss Address File (MAF) 


The miss address file (MAF) holds physical addresses associated with pending Icache and Dcache fill 
requests and pending IO space reads. It contains eight entries. 


2.1.5.4 D-stream Translation Buffer (DTB) 


The Mbox includes a 128-entry, fully associative translation buffer used to store recently used D-stream 
address translations and page protection information. Each of the entries in the DTB can map 1, 8, 64 or 
512 contiguous 8K byte pages. The allocation scheme is round-robin. The DTB supports an 8-bit ASN 
and contains an ASM bit. 


2.1.6 Cbox 

The CBOX controls the Bcache and System ports. It contains the following structures: 
e Victim Address File (VAF) 

e Victim Data File (VDF) 

e JO Write Buffer TOWB) 

e Probe Queue (PQ) 

e Duplicate Dcache Tags (DTAGS) 


2.1.6.1 Victim Address File (VAF) and Victim Data File (VDF) 


The VAF and VDF together form an 8-entry victim buffer used for holding: 
e Dcache blocks to be written to the Bcache 

e  I-stream cache blocks from memory to be written to the Beache 
_@  Beache blocks to be written to memory 

e Cache blocks sent to the system in response to probe seinmands 


2.1.6.2 IO Write Buffer (IOWB) 


The IOWB consists of four 64-byte entries and associated address and control used for buffering IO write 
data between the store queue and the System port. 


2.1.6.3 Probe Queue (PQ) 


The probe queue (PQ) is an eight-deep queue which holds pending System port cache probe commands 
and addresses. 


2.1.6.4 Duplicate Dcache Tag (DTAG) Array 


The DTAG array holds a duplicate copy of the Deache tags and is used by the Cbox when processing 
Deache fills, Icache fills and System port probes. See section 3 for more details. 
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2.2 Pipeline Organization 
The machine’s basic pipeline is shown below: 
Fi S|M 

-1}; 0), 14 2 





2.2.1 Stage 0 - Instruction Fetch 


In the fetch stage of the pipe, up to four aligned instructions are fetched from the Icachc. The branch 
prediction tables are also accessed in this cycle. The branch tables produce a prediction for one branch or 
memory format JSR instruction per cycle, hence the prefetcher is limited to fetching through one branch 
per cycle. If there is more than one branch within the fetch line, and the branch predictor predicts that the 
first branch will not be taken, it will predict through subsequent branches at the rate of one per cycle, until 
it predicts a taken branch or predicts through the last branch in the fetch line. _ 


The Icache array also contains a line prediction field, the contents of which are applied to the Icache in 
the next cycle. The purpose of the line predictor is to remove the pipeline bubble which would otherwise 
be created when the branch predictor predicts a branch to be taken. In effect, the line predictor attempts to 
predict the Icache line which the branch predictor will generate. On fills, the line predictor value at each 
fetch line is initialized with the index of the next sequential fetch line, and later retrained by the branch 
predictor if necessary. 


2.2.2 Stage 1 - Instruction Slot 


In the slot stage the branch predictor compares the next Icache index that it generates to the index that 
was generated by the line predictor. If there’s a mismatch the branch predictor wins - the instructions 
fetched during that cycle are aborted, and the index predicted by the branch predictor is applied to the 
Icache the next cycle. Line mispredicts result in one pipeline bubble. 


There is one case where the line predictor takes precedence over the branch predictor - memory format 
calls or jumps. If the line predictor was trained with a true (as opposed to predicted) memory format call 
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or jump target, then its contents take precedence over the target hint field associated with these 
instructions. This allows dynamic calls or jumps to be correctly predicted. 


The instruction fetcher produces the full VPC during the fetch stage of the pipe. The Icache produces the 
tags for both sets 0 and 1 each time it’s accessed, which enables the fetcher to differentiate set mispredicts 
from true Icache misses. If the access was a set mispredict the fetcher aborts the last two fetched slots and 
re-fetches the slot in the next cycle. It also retrains the appropriate set prediction bits. 


The instruction data is transferred from the icache to the integer and floating point register map hardware 
during this stage. In addition the integer instructions begin to pass through the slot logic, which 
determines whether they will use upper or lower eboxes. 


2.2.3 Stage 2 - Map 


Instructions are sent from the Icache to the integer and floating point register maps during the.slot stage, 
and register renaming is performed during the map stage. Also, each instruction is.assigned a unique 8-bit 
number, called an inum, which is used to identify the instruction and its program order with respect to 
other instructions during the time that it is in flight. Instructions are considered to be in flight between the 
time they are mapped and the time they are retired. 


Mapped instructions and their associated inums are placed in the integer and floating point queues by the 
end of the map stage. 


2.2.4 Stage 3 - Issue 


| Instructions are selected for execution by the IQ and FQ during the issue stage of the pipe. In general, 
instructions are deleted from the IQ or FQ two cycles after they issue - i.e. if an instruction issues in cycle 
N, it remains in the queue but does not request service in cycle N+1,.and is gone in cycle N+2. . 


2.2.5 Stage 4 - Register Read 


Instructions which are issued from the queues read their operands from the register files and receive 
bypass data. 





2.2.6 Stage 5 - Execute 
The Ebox and Fbox pipelines begin execution in this pipe stage. 


2.2.7 Stage 6 - Dcache Access 


Memory reference instructions access the Dcache and data translation buffers in this pipe stage. In 
general, loads access both the tag and data arrays in pipe stage 6, while stores only access the tag array. 
Store data is written into the store queue where it is held until the store instruction retires. 


Most integer operate instructions write their register results in this cycle. 


2.2.8 Instruction Retire 


A given instruction retires when it has been executed to completion, and all previous instructions have 
been retired. The execution pipe stage in which a given instruction becomes eligible to be retired depends 
upon the type of instruction. The following table gives the minimum retire latencies (assuming that all 
previous instructions have been retired) for various classes of instructions: 


Instruction Class Retire Stage Comments 


| INT Conditional Branch = =7 
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INT Multiply 7/13 (13 for MUL/V) 

INT Operate 7 

Memory 10 

FP Add 11 

FP Mul 11 

FP DIV/SQRT 11+L* Add latency of instruction see section 2.7.3. Latency is 11 if 
hardware detects that no exception is possible (see section 
2.2.8.1) 

FP Conditional Branch 11 Branch mispredict is reported in stage 7 

BSR/JSR 10 JSR mi ict is reposted in stage 8 


2.2.8.1 FP Divide/Square Root Early Retire 


The FP divider and square root unit can detect that, for many combinations of source operand values, no 
exception can be generated. Instructions with these operands can retire before the result is generated. 
When detected, they retire with the same latency as FP Add. Early retire is not possible for the following 
instruction/operand/architecture state conditions: 

Instruction is not a DIV or SQRT 

SQRT source operand is negative 

Divide operand exponent_a is 0 

Either operand is NaN or INF 

Divide operand exponent_b is 0 

Trapping mode is /I (inexact) 

INE status bit is 0 2. ae 

Early retire is also not possible for divides if the resulting exponent has any of the following 
characteristics (define EXP as the result exponent): 

e DIVT,DIVG: EXP >= 0x3ff or EXP <= 0x2 

e DIVS,DIVF: EXP >= 0x7f or EXP <= 0x382 


2.2.9 Retire of Operates into R31/F31 


Many instructions which have R31 or F31 as their destination are retired immediately upon decode (stage 
3). These instructions do not produce a result and are ‘squashed’ from the pipeline as well -- they do not 
occupy a slot in the issue queues and do not occupy a functional unit. 


_Anstruction Type 2, a a eRe MS eo ACE ete a RC MRE naen ee de ONG? 
INTA, INTL, INTM, All with R31 as destination 
INTS 
FLTI, FLTL,FLTV_ ~- All with F31 as destination. MT_FPCR is not included because it has no 
destination -- it is never squashed 
LDQ _U All with R31 as destination 
MISC TRAPB and EXCB are always squashed. Others are never squashed. 
FLTS All (SQRT, ITOF) with F31 as destination 


2.2.10 Pipeline Aborts 


The following table lists the timing associated with each common source of pipeline abort. The abort 
penalty as given is measured from the cycle after the fetch stage of the instruction which triggers the abort 
to the fetch stage of the new target, ignoring any Ibox pipeline stalls or queuing delay which the triggering 
instruction might experience. 





Abort Condition Penalty Comments 
(cycles) Soest 
Branch mispredict 7 integer or floating conditional branch mispredict 
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Abort Condition Penalty Comments 


cycles 
JSR mispredict 8 memory format JSR or HW_RET 
Mbox order trap 14 load-load order, store-load order 
Other Mbox replay traps 13 
DTB miss 13 
ITB miss 7 
Integer arithmetic trap 12 
FP arithmetic trap 13+L Add latency of instruction. See section 2.7.3 2.7.3for 


instruction latencies. 


2.3 Memory And I/O Accesses 


This section provides a brief overview of EV6 processing of memory and IO references. 


The IQ may issue any combination of loads and stores to the Mbox at the rate of two per cycle. The two 
lower Ebox subclusters, LO and L1, generate the 48-bit effective virtual address for these instructions. 


In the discussions which follow, an instruction is said to be newer than another instruction if it follows 
that instruction in program order and is said to be older if it precedes that instruction in program order. 


2.3.1 Memory Space Load Instructions 


The Mbox begins executicr. cf a load instruction by translating its virtual.address to a physical address 
using the DTB and by accessing the Dcache. The Dcache is virtually indexed, allowing these two 
operations to be done in parallel. The Mbox puts information about the load, including its physical 
address, destination register and data format, into the load queue. 


If the requested physical location is found in the Dcache (a hit) the data is formatted and written into the 
appropriate intcger or floating register. If the location is not in the Dcache (a miss) then the physical 
address is placed in the miss address file (MAF) for processing by the Cbox. The MAF performs a 
merging function in which a new miss address is compared to miss addresses already held in the MAF. If 
the new miss address is to the same Dcache block as a miss address already held in the MAF, then the 
new miss address is discarded. 


When Decache fill data is returned to the Dcache by the Cbox, the Mbox areca the requesting loads in 
- the load 4 weue. . , 


Zoek 10 Space Load Instructions 


Since IO space reads may have side effects, they can’t be done speculatively. Hence, when the Mbox 
receives an IO space read it first places it in the load queue, where it is held until it retires. The Mbox 
replays retired IO space reads from the load queue to the MAF in program order at a rate of one per CPU 
cycle. 5.3.9 


The MAF handles IO reads differently from memory reads, since for IO space reads the system requires 
an indication as to which bytes were actually accessed by the CPU. Each MAF entry contains 8 mask bits 
and a 2-bit length field to hold this information, and may thus hold: 


a single byte or word IO read (byte and word length IO reads are not merged), or 
up to eight longword IO reads within an aligned 32-byte region, or 

up to eight quadword IO reads within an aligned 64-byte region, or 

a single memory space read for an aligned 64-byte Dcache block 
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EV6 maintains IO reference ordering as follows (assume address X and address Y are different): 


LD-IO to address X ' LD-IO to address Y maintained 


ST-IO to address X LD-IO to address X maintained 
ST-IO to address X LD-IO to address Y not maintained 

















When the Mbox allocates a new MAF entry to an IO read, it attempts to merge other IO reads into the. 
same entry until one of the following conditions occur, at which point the entry may be serviced by the 
Chox. te 


e an JO read which doesn’t merge with the entry is replayed from the load queue 

e Four cycles go by without an IO read merging with the entry 

e an JO read which matches the entry but touches a mask bit which is already set is replayed from the 
load queue 

e anJIO write matches the entry 


The Cbox sends IO read requests off-chip in the order in which they were received from the Mbox. _ 


2.3.3 Memory Space Store Instructions 


The Mbox begins execution of a store instruction by translating its virtual address to a physical address 
using the DTB and by probing the Dcache. The Mbox puts information about the store, including its 
physical address, its data and the results of the Dcache probe, into the store queue. 


If the Mbox does not find the addressed location in the Dcache then it places the address into the MAF for 
processing by the Cbox. If the Mbox finds the addressed location in a Dcache block which isn’t dirty, then 
it places a ChangoToDirty request into the MAF. 


A given store instruction may write the Dcache when it is retired and when the Dcache block containing 
its address is dirty in the Dcache. Store queue entries which meet these two conditions may be placed into 
the writeable state, and are done so in program order at a maximum rate of two entries per cycle. The 
Mbox transfers writable store queue entries from the store queue to the Dcache in program order at a 
maximum rate of two stores per cycle. Dcache lines associated with writable store queue entries are locked 
down by the Mbox - System port probe commands cannot evict these blocks until their associated writable 
store queue entries have been transferred into the Dcache. This restriction assists in store-conditional and 
Dcache ECC processing. 


Stores in the store queue which have not been transferred to the Dcache may source data to 
newer load instructions. The Mbox compares the virtual Dcache index bits of incoming loads to queued 
stores, and sources the data from the store queue, bypassing the Dcache, when necessary. 


2.3.4 IO Space Store Instructions 


The Mbox begins processing IO space stores just like memory space stores - by translating the virtual 
address and placing state associated with the store into the store queue. . 


The Mbox replays retired IO space stores from the store queue to the IOWB in program order at a rate of 
one per CPU cycle. Each IOWB entry may contain: 
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e asingle byte or word IO write (byte and word length IO writes are not merged), or 
e up to eight longword IO writes within an aligned 32-byte region, or 
© up to eight quadword IO writes within an aligned 64-byte region 


When the Mbox allocates a new IOWB entry to an IO write, it attempts to merge other IO writes into the 
same entry until one of the following conditions occur, at which point the entry may be serviced by the 
Cbox. 


e an IO write which doesn’t merge with the entry is replayed from the store queue 
e Fourcycles go by without an IO write merging with the entry 
an IO write which matches the entry but touches a mask bit which is already set is replayed from the 
store queue 
an IO read matches this entry: 
a WMB instruction is replayed from the store queue 


The Mbox never allows queued IO space stores to source data to subsequent loads. The Cbox sends IO 
space write requests off-chip in the order they were received from the Mbox. 


2.4 Replay Traps 


There are some situations in which a load or store instruction can not be executed due to a condition 
which is detected after that instruction issues from the IQ or FQ. The instruction is therefore aborted 
(along with all newer instructions) and restarted from the fetch stage of the pipeline. This mechanism is 
called a replay trap. 


2.4.1 Mbox Order Traps . . 


Load and store instructions may be issued from the IQ in a different order than they were fetched from the.: 
Icache, while architecturally, D-stream memory accesses to the same physical bytes must be completed in . 
order. Generally, the Mbox manages the memory reference stream by itself to achieve architecturally 
correct behavior, but there are two cases in which replay traps are used to manage the memory stream. 


The Mbox ensures that loads which reference the same physical byte(s) ultimately issue in order via the 
load-load order trap. The Mbox compares the address of each newly issued load to that of all loads in the 
load queue. If it finds a newer load instruction in the load queue then it invokes a load-load order trap on 
the newer instruction. This is a replay trap which aborts the target of the trap and all newer instructions 
from the machine and refetches instructions starting at the target of the trap. 


The Mbox ensures that a load ultimately issues after an older store which writes some portion of the its 
memory operand via the store-load order trap. The Mbox compares the address of each newly issued 
store to that of all loads in the load queue. If it finds a newer load instruction in the load queue then it 
invokes a store-load order trap on the load instruction. This is a replay trap, just like the load-load order 
trap. The Ibox contains extra hardware to reduce the frequency of this trap. There is a one-bit by 1024- 
entry PC-indexed table in the Ibox called the stWait table. At Icache fetch time this table is accessed 
along with the Icache. The table produces one bit for each instruction accessed from the Icache. When a 
load instruction gets a store-load order replay trap its associated bit in the stWait table is set during the 
cycle that the load is re-fetched. Hence the trapping load’s stWait bit will be set the next time it’s fetched. 
The IQ will not issue load instructions whose stWait bit is set while there are older unissued stores in the 
queue. A load instruction whose stWait bit is set can issue the cycle immediately after the last older store 
issues from the queue. All the bits in the stWait table are unconditionally cleared every 16384 cycles. 
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2.4.2 Other Mbox Replay Traps 


The Mbox also uses replay traps to flow control the load queue and store queue, and to ensure that is there 
are never multiple outstanding misses to different physical addresses which map to the same Dcache or 
Bcache line. Unlike the order traps, however, these replay traps are invoked on the incoming instruction 
which triggered the condition. 


2.5 Software-Directed Prefetching and Loads into R31 & F31 
This section describes how EV6 processes the various forms of load into R31/F31. 


First, EV6 requires PALcode assistance to conform to ECO 95 - loads into R31/F31 may generate 
exceptions - these exceptions must be dismissed by PALcode. 


2.5.1 Normal Prefetch: LDL, LDF, LDG, LDB, LDW 


EV6 processes these instructions as "normal" cache line prefetches - if the load hits the Deache, the 
instruction is dismissed, otherwise the addressed cache block is allocated into the Dcache. 


2.5.2 Prefetch with Modify Intent: LDS 


EV6 processes a LDQ into F31 as a prefetch with modify intent. If the load hits a dirty, modified DCache 
block the instruction is dismissed. Otherwise, the addressed cache block is allocated into the Dcache for 
write access - its dirty and modified bits are set. 


2.5.3 Prefetch, Evict Next: LDQ | 


EV6 processes this like a "normal" prefetch, with one exception. If the load misses the Dcache, the 
addressed cache block is allocated into the Dcache, but the Dcache set allocation pointer is left pointing to 
this block. The next miss to the same Dcache line will evict the block. One example where this 
instruction might be used is when software is reading an array which is known to fit in the off-chip 
secondary cache, but will not fit in the on-chip Dcache. The use of the instruction in this case will ensure 
that hardware provides the desired prefetch function without displacing useful cache blocks stored the 
other set of the Dcache. 


2.5.4 Prefetch, No Reuse: LDT 


This instruction will indicate to EV6 that the addressed cache block will be accessed once and not 
accessed again for a long time. This instruction might be used when sweeping through the contents of an 
array which is known to be larger than the secondary cache, for example, and will inform EV6 to perform 
a cache line prefetch without displacing otherwise useful cache blocks. 


EV6 will respond to this instruction as follows. If the load hit the Dcachc the instruction is dismissed. 
Otherwise the addressed cache block is fetched from the Bcache or memory, depending upon the result of 
the Bcache tag probe, and transmitted across EV6's internal data busses. This external reference will not 
result in a fill of either the Dcache or the Bcache, however. Any loads to the same cache block and which 
issue after the prefetch issues but before the block is transmitted across the internal busses will be satisfied 
when the prefetched block is transmitted across the internal data busses. Loads to this cache block which 
issue after the block is transmitted will miss the Dcache and result in another external read, either to 
memory or the BCache. 


2.6 Special Cases 


This section describes the mechanisms by which EV6 processes “irregular” instructions in the Alpha 
instruction set, or cases in which EV6 processes instructions in a non-intuitive way. 
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2.6.1 Load Hit Speculation 
The latency of integer loads which hit in the Dcache is three cycles. Here is the pipeline timing: 


Hit 

cycle # 1)/2)3 44 f 61;7|8 
ILD Q;]R}E|D;B 
instr1 Q|R 
instr2 Q 


. There are two cycles in which the IQ may speculatively issue instructions which consume load data before 
Deache hit information is known. Any instructions which issue from the IQ within this two cycle 
“speculative window” are kept in the IQ with their requests inhibited until the load’s hit condition is 
known, even if they are not dependent on the load. If the load hits then these instructions are removed 
from the queue. If the load misses then the execution of these instructions is aborted and the instructions 
are allowed to request service again. For example, in the above diagram, instr1 and instr2 are issued 
within the speculative window of the load. If the load hits then both instructions will be deleted from the 

| queue by the start of cycle 7 - one cycle later than normal for instr1 and at the normal time for instr2. If 
the load misses then both instructions are aborted from the execution pipelines and may request service 
again in cycle 6. 


IQ-issued instructions are aborted if issued within the speculative window of an integer load which missed 
the Dcache, even if they are not dependent on the load. However, if software knows misses are likely, it 
can still benefit from scheduling the instruction stream for Dcache miss latency. EV6 includes a saturating” 
counter which is incremented by load misses and decremented by load hits. When the upper bit of the  ~ 
counter is set the integer load latency is increased to five cycles, and the speculative window is removed. _ 
The counter is 5 bits wide, and increments by two on a miss and by one on a hit. 


Since loads into R31 do not produce a result, they do not create a “speculative window” when they execute 
and therefore never waste IQ-issue cycles if they miss. 


Floating loads which hit in the Deache have a latency of four cycles. Here is the pipeline timing: 


Hit 
cycle # 1/2 3 me 61718 
FLD QIR|E;DI]B 
instr1 Q;R 
instr2 Q 


For floating loads the speculative window is only one cycle wide, and FQ-issued instructions which issue 
within the speculative window of a missing floating load are only aborted if they depend on the load. For 
example, in the above diagram instr1 is issued in the speculative window of the load. If it is not a 
consumer of the data returned by the load then it is removed from the queue at its normal time - just at the 
start of cycle 7. If itis dependent on the load data and the load hit it is removed from the queue one cycle 
later - at the start of cycle 8, while if the load missed then it is aborted from the Fbox pipeline and may 
request service again in cycle 7. 
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2.6.2 Floating Point Stores 


Floating point store instructions are cloned and loaded into both the IQ and the FQ from the mapper. 
Each IQ entry contains a control bit called fpWait, which when set prevents that entry from asserting its 
requests. This bit is initially set for each floating store which enters the IQ unless it was the target of a 
replay trap. The instruction’s FQ clone issues when its Ra register is about to become clean, resulting in 
its IQ clone’s fpWait bit being cleared and allowing the IQ clone to issue and be executed by the Mbox. 
This mechanism ensures that floating stores are always issued to the Mbox along with their associated 
data without requiring the floating register dirty bits to be available within the IQ. 


2.6.3 CMOV 


For EV6, the Alpha CMOV instruction has three operands, and thus presents a special case. The required 
operation is to move either the value in register Rb or the value from the old physical destination register 
into the new destination register, based on the value in Ra. Since neither the mapper nor the Ebox and 
Fbox data paths are otherwise required to handle three operand instructions, the CMOV instruction is 
decomposed by the Ibox pipeline into two, two-operand instructions. 


cmov’ Ra,Rb -> Re 


cmovl Ra,oldRc -> newRcl 
cmov2 newKcl, Rb -> newRc2 


The first instruction, cmov1, tests the value of Ra and records the result of this test in a 65" bit of its 
destination register, newRc1. It also copies the value of the old physical destination register, oldRc, to 
newRcl. The second instruction, cmov2, then copies either the value in newRc1 or the value in Rb into a 
second physical destination register, newRc2, based on the CMOV “predicate” bit stored in newRc1. In 
summary, the original CMOV instruction is decomposed into two dependent instructions which each 
consume a physical register from the free list. 


In order to further simplify this operation the two component instructions of a CMOV instruction are 
driven through the mappers in successive cycles. Hence, if a given fetch line contains NCMOV 
instructions, it takes N+1 cycles to run that fetch line through the mappers. For example, the following 
fetch line: 


add cmovx sub cmovy 
results in the following three map cycles: 


add cmovx1 
cmovx2 sub cmovyl 
cmovy2 


Integer CMOVSs are executed as two distinct one-cycle latency operations by the Ebox. 
Floating CMOVs are executed as two distinct four-cycle latency operations by the Fbox add pipeline. 
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2.7 Instruction Issue Rules 


This section defines instruction classes, the functional unit pipelines to which they are issued, and their 
associated latencies. 


2.7.1 Instruction Class Definitions 


The table below defines instruction classes as they apply to the issue rules, and for each class specifies 
which of the functional unit pipelines execute those instructions. 





__Class Nan Name Pipeline Instruction List 

“id LO, L1 all integer loads 

fid LO, Li all floating loads 

ist LO, L1 all integer stores 

fst FSTO, FST1, LO, L1 all floating stores 

Ida LO,L1,U0,U1 LDA, LDAH 

mem misc Ll WH64, ECB, WMB 

rpcc Ll RPCC 

rx Ll RS, RC 

mxpr LO, Li (depends on HW_MTPR,HW_MFPR 

IPR) 

ibr U0, U1 integer conditional branches 

jsr LO BR, BSR, JMP, CALL, RET, COR, HW_RET, CALL_PAL . 

iadd LO, U0, L1, U1 opcode 104., except CMPBGE 

ilog LO, U0, L1, U1 AND, BIC, BIS, ORNOT, XOR, EQV, CMPBGE . 

ishf U0, Ul opcode 12:6 

cmoy LO, UO, Li, U1 integer CMOV - either clone 

imul Ul integer multiplies 

imisc U0 LOC, TOC, POPC, PERR, MINxxx, MAXxxx, PKxx, UNPKxx 

fbr FA floating conditional branches 

fadd FA all floating operates except multiply, divide, square root and 
conditional move 

fmul FM floating multiply 

femovl FA floating CMOV - first half 

femov2 FA floating CMOV - second hali 

fdiv FA floating divide 

fsqrt FA floating square root 

nop none TRAP, EXCB, UNOP -LDQ_U R31, 0(Rx) 

ftoi LO, Li FTOIS, FTOIT 

itof LO, L1 ITOFS, ITOFF, ITOFT 

mx fpcr FM move from floating point control register 


2.7.2 Ebox Slotting 


Instructions which issue from the IQ and could execute in either upper or lower Ebox subclusters are 
slotted to one pair or the other during the map stage of the pipeline, based on the instruction mix in the 
fetch line. These slotting rules are defined in the table below. In the type column, “U” means the 
instruction Only executes in an upper subcluster, “L” means the instruction only executes in a lower 
subcluster, and “E” means the instruction could execute in either an upper or lower subcluster. The 
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numbers 3,2,1 and 0 identify each instruction’s location in the fetch line by the value of bits <3:2> of its 
PC. 
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Instruction Type 


3210 
EEEE 
EEEL 
EEEU 
EELE 
EELL 
EELU 
EEUE 
EEUL 
EEUU 
ELEE 
ELEL 
ELEU 
ELLE 
ELLL 
ELLU 
ELUE 
ELUL 
ELUU 
EUEE 
EUEL 
EUEU 
EULE 
EULL 
EULU 
EUUE 
EUUL 
EUUU 
LEEE 
LEEL 
LEEU 
LELE 
LELL 
LELU 
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ULUL | 


ULUL 
ULLU 
ULLU 
UULL 
ULLU 
ULUL 
ULUL 
LLUU 


—ULUL 


ULUL 
ULLU 
ULLU 
ULLL 
ULLU 
ULUL 
ULUL 
LLUU 
LULU 
LUUL 
LULU 
LULU 
UULL 
LULU 
LUUL 
LUUL 
LUUU 
LULU 
LUUL 
LULU 
LULU 
LULL 
LULU 


Instruction Type 
3210 
LLLL 
LLLU 
LLUE 
LLUL 
LLUU 
LUEE 
LUEL 
LUEU 
LULE 
LULL 
LULU 
LUUE 
LUUL 
LUUU 
UEEE 
UEEL 
UEEU 
UELE 
UELL 
UELU. 
UEUE 
UEUL 
UEUU 
ULEE 
ULEL 
ULEU 
ULLE 
ULLL 
ULLU 
ULUE 
ULUL 
ULUU 
UUEE 
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LLLL 
LLLU 
LLUU 
LLUL 
LLUU 
LULU 
LUUL 
LULU 
LULU 
LULL 
LULU 
LUUL 
LUUL 
LUUU 
ULUL 
ULUL 
ULLU 
ULLU 
UULL 
ULLU 
ULUL 
ULUL 
ULUU 
ULUL 
ULUL 
ULLU 
ULLU 
ULLL 
ULLU 
ULUL 
ULUL 
ULUU 
UULL 
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2.7.3 Instruction Latencies 


After an instruction is placed in the IQ or FQ, its issue point is determined by the availability of its 
register operands, functional unit(s), and relationship to other instructions in the queue. There are register 
producer-consumer dependencies and dynamic functional unit availability dependencies which affect 
instruction issue. The mapper removes register producer-producer dependencies. 


The latency to produce a register result is generally fixed. The exception is for loads which miss the 
Deache. 





i ae MIC ns a sc en atleast isin chen dlataatcslacalans 
ild 3 Deache hit 
13+; Dcache miss, latency with 6-cycle Bcache. Add additional bcache loop 

| latency if bcache is slower than 6 cycles. 

fid 4 Deache hit 
8+ Deache miss, latency with 0-cycle Bcache. Add Bcache loop latency. 
ist doesn’t produce register value 
fst doesn’t produce register value 
rpcc 1 possible 1 cycle cross cluster delay 
rx 1 
mxpr 1 or 3 HW_MFPR. Ebox IPRs: 1. Ibox & Mbox IPRs: 3. HW_MTPR doesn’t 
produce a register value. 

icbr conditional branch; doesn’t produce register value 
ubr 3 unconditional branch 
jsr 3 
iadd 1 possible 1-cycle Ebox cross-cluster delay 
ilog 1 possible 1-cycle Ebox cross-cluster delay 
ishf 1 possible 1-cycle Ebox cross-cluster delay 
cmovl 1 only consumer is cmov2. possible 1-cycle Ebox cross-cluster delay 
cmov2 1 possible 1-cycle Ebox cross-cluster delay 
imul 7 possible 1-cycle Ebox cross-cluster delay 
imisc 3 possible 1-cycle Ebox cross-cluster delay 


febr doesn’t produce register value 


fadd 4 consumer other than fst or ftoi 
6 consumer fst or ftoi. measured from fadd issuing from FQ to fst or eal 
issuing from IQ 
fmul 4 consumer other than fst or ftoi 
6 consumer fst or ftoi. measured from fmul issuing from FQ to fst or ftoi 
| issuing from IQ 
femovl 4 only consumer is femov2 
femov2 4 consumer other than fst 
6 consumer fst or ftoi. measured from femov2 issuing from FQ to fst or ftoi 
issuing from IQ 
fdiv 12 single precision - latency to consumer of result value 
10 single precision - latency to using divider again 
15 double precision - latency to consumer of result value 
13 double precision - latency to using divider again 
fsqrt 16 single precision - latency to consumer of result value 
14 single precision - latency to using unit again 
32 double precision - latency to consumer of result value 
30 double precision - latency to using unit again 
ftoi 3 
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nop doesn’t produce register value 


3. External Interface 


The external interface consists of two ports - a Bcache port and a System port. The Bcache port is 
controlled entirely by the processor, and is used to interface to a module-level secondary cache which may 
be built from a range of standard synchronous SRAMs. The System port interfaces to the rest of the 
System. The processor contains two external data busses, one 16-bytes wide for the Bcache and the other 
8-bytes wide for the System. : . 







Duplicate 
Tag 
(optional) 


SysAddIn_L<14:0> 
SysAddInClk_L 
SysFill Valid_L 
SysDatalnValid_L 
SysDataOutValid_L 


SysAddOut_L<14:0> 
SysAddOutCik_L 
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Data/Control 


EV6 


3.1 Address Spaces 


EV6 supports a 44-bit physical address space which is divided equally between Memory space and IO 
space. Memory space resides in the lower half of the physical address space (PA<43> clear) and IO space 
resides in the upper half of physical address space (PA<43> set). EV6 recognizes these spaces internally. 


EV6-generated external references to Memory space are always of a fixed 64-byte size, though the internal 
access granularity is byte, word, longword or quadword. All EV6-generated external references to 
Memory or I/O space are physical addresses that are either successfully translated from a virtual address 


or produced by PAL code. On rare occasions, speculative execution may cause a reference to non-existent 
memory. Systems must range check all addresses and report those events to EV6. See section 6.3.8. 


EV6 does not cache IO space data, however it merges both reads and writes and supplies a mask to 
indicate the bytes which are actually accessed. EV6 merges IO space LW and QW Loads and Stores into 
Reads or Writes of up to 32 or 64 bytes respectively. Systems may limit I/O QW writes to 32 bytes 
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maximum by setting the TLASER_STIO_MODE bit in the CBOX csr. Byte and word operations to I/O 
space are never merged in EV6. All LDB,LDW,STB and STW instructions (PA<43>set) generate an 
unique interface command. Finally, references of differing sizes are not merged. 
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3.1.1 I/O Ordering and Merge Rules 
EV6 will adhere to the following rules of order when executing LD and ST instructions to I/O 


1. Consecutive Loads from I/O space happen in the order specified by the programmer. 
2. Consecutive Stores to I/O space happen in the order specified by the programmer. 


3. Loads followed by stores to the same address (within the same 64 byte block) happen in the order 
specified by the programmer. 


4. Stores followed by Loads to the same address (within the same 64 byte block) happen in the order 
specified by the programmer. 


5. Loads followed by Stores (and vice versa) to different addresses (bits <43:6> not caval) may be 
UNORDERED and software can not depend on one occan7i0g first. 


The following matrix illustrates I/O merging rules in EV6. The intersection of two consecutive I/O 
operations contains the rule observed by EV6. Reads and writes will merge in ascending order only 
(obeys default ordering of a PCI device). Finally, merging can be terminated with a timer set to TBD 
CPU cycles. Collapsing (multiple I/O writes to the same location) does not occur in EV6. 


seen BY LE, WORD LONGWORD QUADWORD 
BYTE/WORD No Merge No merge No Merge 
LONGWORD No Merge Merge up to 32 Bytes No Merge 
QUADWORD No merge : No merge Merge Up to 64 Bytes 


A CBOX IPR mode bit that effect merging to I/O space is TLASER_STIO_MODE. When asserted will 
limit stores to I/O space to 32 bytes. 


3.2 Cache Organization and Coherence 
The EV6 cache hierarchy has the following attributes: 


e I-stream data from both IO and Memory space may be cached in the Icache. Icache coherence is not 
maintained by hardware - it must be maintained by software using the IMB instruction. 

e D-stream Memory space data may be cached in the Bcache and Dcache. EV6 ensures that the Dcache 
contents are a subset of the Bcache. This allows Memory requests from other agents in the System to 
ve filtered using only a duplicate copy of the Bcache tags; external duplicate Dcache tags are not 
required. 

e In Systems which use a Bcache, a Bcache duplicate tag store may be used to filter requests, but this is 
not required. 

e System hardware is required to cooperate with EV6 to ensure coherence of the Bcache and Dcache. 


3.2.1 Cache Block States 
EV6 supports the following cache block states: 





State Name Description 
Invalid 


Clean This processor holds a copy of the block, but no other agent in the System holds a copy. 
Clean/Shared _ This processor and at least one other agent in the System may hold a copy of the block. 
Dirty This processor may write to the block and must write it to Memory after it’s evicted from 


__ the cache. No other agent in the System holds a copy of the block. 
Dirty/Shared The dirty block may be shared - this processor must write it back to Memory when it’s 
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evicted. The block may not be written by this processor. 





3.2.2 Cache Block State Transitions 


Cache block state transitions may be triggered by EV6-generated commands to the System or by System- 
generated commands to EV6. The latter are called probes. The diagram below shows the cache state 
transitions which are triggered by EV6’s actions. 





RdBikMod, 
WCBH 


EV6 issues two types of reads to Memory space: RdBIk and RdBlkMod. EV6 will mark a cache block 
from a RdBlk command sent to the System either CLEAN or CLEAN/SHARED or even DIRTY 
depending on the response from the System during the cache fill. Systems can not transition a CLEAN 
block to DIRTY or DIRTY/SHARED with a probe command. 


EV6 will send a ChangeToDirty command to the System against a block not in the DIRTY state when it 
wants to write that block. There are two types of ChangeToDirty commands which EV6 may use based on 
the initial state of the cache block: CleanToDirty and SharedToDirty. Having two flavors of 
ChangeToDirty relieves Systems with a duplicate tag and address CAM from having to do a read-modify- 
write of the tag store in response to the ChangeToDirty command. Also, Systems need not generate a 
System bus invalidate in response to a CleanToDirty command. 


EV6 will send an InvalToDirty command to the System in response to the execution of a WCBH 

instruction (when enabled with INVALTODIRTY_ENABLE csr) which does not hit on a Bcache block ( 

if the block is valid, it defaults to the ChangeTodirty command rules). This will cause the block to 

transition to the DIRTY state, and other agents in the System should invalidate their copies of the block. 
There is no data movement associated with this command. A success response can be from Systems by | 
using the ChangeToDirty Success command encoding in SysDc<4:0>. See section 1.3.8 for details on 

data transfer commands that include responses to ChangeToDurty type commands. 


EV6 will send a Wr VictimBlk command to the System when evicting a dirty or dirty/shared cache block, 
and may also be configured to send a Clean VictimBIk to the System when evicting a clean or shared 
block. 


The System sends probe commands to EV6 both to invoke data movement from EV6 to the System, and to 
change cache block states. Systems with duplicate tags can directly specify the cache block state transition 
which should occur, while Systems without duplicate tags specify a “transition type” which is combined 
with the results of the probe to determine the final state transition. 


Digital Confidential DoNotCopy _ - 32 


3.2.3 System Knowledge of Bcache Contents 


EV6 will support Systems both with and without specialized hardware apparatus that track the state of 
the Bcache, and will take different actions on this basis. There are two principal differences related to the 
cache coherence protocol. 


Systems with duplicate Bcache tags or Memory resident directory maps only send probes to EV6 for 
cache blocks that are relevant (Bcache hit). These Systems know the final state of the cache 
block and can specify it. Ending status is not conditioned by the probe lookup in EV6. In 
contrast, Systems that do not have knowledge of internal Bcache status do not know the result of 
the probe in advance, so both data movement and cache block state transitions are conditioned by 
the results of the probe. 

Systems with duplicate Bcache tags or Memory resident directory maps will require CleanToDirty 
commands sent to the System port by EV6 to keep the external status tracking hardware up-to- 
date. This is not necessary in non-duplicate tag Systems. SharedToDirty commands to Shared 
blocks in either type of System must result in bus invalidates, and thus always appear on the 
System port. 


3.2.4 Deache States & the Dcache Duplicate Tags 


Each Deache block contains an extra state bit beyond those required to support the cache protocol. The 
modified bit, when set, indicates that the associated block should be written to the Bcache when it’s 
evicted from the Dcache. The modified bit is set in two cases: 


1. When a block is filled into the Dcache from Memory its modified bit is set, ensuring that it also gets 
__ filled into the Bcache. 
"2. When the processor writes to a dirty Deache block the modified bit is set, indicating it should be 
written to the Bcache when evicted. 


The DTAG array holds a physically indexed duplicate copy of the Dcache tags. Since the Dcache contains 
64KB and is virtually indexed, a given physical address could reside in any one of eight places in the 
cache. The Cbox uses the DTAGS for the following situations. ° 


1. When the Mbox requests a Dcache fill, the Cbox uses the DTAGS to see if the Dcache already 
contains the requested physical address in another virtually indexed Dcache line. If so, the Cbox 
invalidates that cache line after first writing the data back to the Bcache if it was in the modified 
state. The Cbox also checks to see if the Dcache contains an address different from the requested 
address but which maps to the same Bcache line. If so, the Dcache line is evicted in order to keep the 
Deache a subset of the Bcache. 

-2. When the Ibox requests an Icache fill, the Cbox uses the DTAGS to see if the Dcache contains the 
requested physical address in the modifed state. If so, the Cbox forces the line to be written back to 
the Bcache before servicing the Icache fill request. The Cbox also checks to see if the Dcache contains 
an address different from the requested address but which maps to the same Bcache line. In this case 
the I-stream request will miss the Bcache, and the Cbox will service the request by launching a 
noncached fetch to the System port and will not put the I-stream block into the Bcache. This 
mechanism allows EV6 to use a cache resident “lock flag” for LDx_L/STx_C instructions. 

3. The Cbox uses the DTAGS to determine whether probe addresses are held in the Dcache. 


3.2.5 Memory Barrier (MB/WMB/TBfill flow) 


There is a mode bit called SYS_MB in the CBOX Control CSR which controls whether MB instructions 
produce external System port transactions. EV6 will need to generate System port MB transactions in 
Systems which allow READ and ChangeToDirty responses to reach EV6 ahead of system probes (out of 
order with respect to the order that transactions reached the System’s serialization point). An external 
system MB command is required in systems which do not compare incoming addresses as well as allow 
refills to be seen by EV6 out of order with respect to the order that commands reached the system 
serialization point. 
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A counter exists in the CBOX that contains the number of pending uncommited transactions. The counter 
will increment for the following commands: 


e RdBik, RdBikMod, RdBikl, 

e valid RdBlkSpec, valid RdBlkModSpec, valid RdBlkSpecI, 

e RdBikVic, RdBlkModVic, RdBikVicl 

e CleanToDirty, SharedToDirty, STChangeToDirty, InvalToDirty 
e FetchBlk, valid FetchBikSpec, Evict, RdByte, RdLw, RdQw 


The counter is decremented with the C (commit) bit in the Probe and SysDc commands described in 
Section 3.3.7. Systems can send the C bit in the SysDc fill-response to the commands which increment 
the counter or on the last probe seen by that command when it reached the system serialization point. 


When an MB instruction is fetched, it stalls in the map stage of the pipeline. This also stalls all 
instructions after the MB until: 


1. If SYS_MB is clear the EV6 CBOX waits for the integer issue queue to empty and performs the 
following actions: Sends all pending miss address file (MAF) and WRIO entries to the system 
port 
e Monitors a 4-bit counter of outstanding committed events. When the counter decrements 
from one to zero, CBOX marks the youngest probe queue entry 
e Waits until the miss address file contains no more dstream references, the store queue, load 
queue and I/O write buffers are empty 
When all above have occurred and a probe response has been sent to the system for the marked probe 
queue entry, instruction execution continues with the instruction after the MB. 


2. If SYS_MB is set, the EV6 CBOX performs the following actions:Sends all pending MAF entries to 
the system port, 
Sends the MB command to the System port, 
Waits until the MB command is acknowledged and marks the youngest entry in the probe 
queue 
e Waits until the miss address file contains no more dstream references, the store queue, load 
queue and I/O write buffers are empty 
When all above have occurred and a probe response has been sent to the system for the marked probe 
queue entry, instruction execution continues with the instruction after the MB. Write Memory Barriers 
(WMB’s) are issued into the MBOX store-queue, wait until they are retired and become writeable, and 
when the writeable pointer reaches the WMB, the MBOX freezes the writeable pointer and informs the 
CBOX. The CBOX closes the write buffer and responds based on SYS_MB.If SYS_MB is clear the EV6° 
CBOX performs the following actions: Marks the youngest entry in the probe queue 


When a probe response has been sent to the system for the marked probe queue entry, the MBOX 
unfreezes and advances the writeable pointer. 


If SYS_MB is set the EV6 CBOX performs the following actions:Sends the MB command to the System 
port, 
e Waits until the MB command is acknowledged and marks the youngest entry in the probe 
queue 
When a probe response has been sent to the system for the marked probe queue entry, the MBOX 
unfreezes and advances the writeable pointer.Loads to a virtual page table entry (HW_LD/VPTE) are 
processed by EV6 so as to avoid litmus test problems associated with the ordering of memory accesses 
from another processor against load of a page table entry and the subsequent virtual-mode load from this 


processor. Consider the following : 


Poe ee re ee Serer rrr rrr 
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‘Wr Data; LD/ST Data; 


MB <TB Miss> 

Wr PTE, LD-PTE 
<wr TB> 
LD/ST (restart> 





P; must get the updated Data; if it got the updated PTE,. Also consider the related 


Pi | ae eee ere 
Wr Data; I-stream read Data; 
MB <TB Miss> 
Wr PTE, LD-PTE 

<wr TB> 

I-stream read 

(restart) - will miss 

the Icache 





In this case the Data could be cached in the Bcache; P; should fetch Data; if it is using PTE. EV6 
processes dstream loads to the page table entry by injecting, in hardware, some memory barrier processing 
between the access of the page table entry and any subsequent load or store. This is accomplished by the 
following mechanism:Integer queue issues a HW_LD/VPTE 

e Integer queue issues a HW_MTPR DTB_PTEO which is data-dependent on the 
HW_LD/VPTE and is required in order to fill the DTB’s. The HW_MTPR, when enqueued, 
set IPR scoreboard bits <4> and <0>. 

e On issue of HW_MTPR DTB_PTEO, IBOX signals CBOX that a HW_LD/VPTE has been 
processed and causes CBOX to begin “MB” processing. IBOX prevents issue of any 
subsequent memory operations by not clearing the IPR scoreboard bit <O> (one of the 
scoreboard bits associated with the HW_MTPR DTB_PTE0). 

e When “MB” processing is complete (one of the above sequences, depending on SYS_MB), 
CBOX signals IBOX to clear IPR scoreboard bit <0>. 


EV6 processes TB niss fills to the page table entry via a similar mechanism:Integer qucue issues a 
HW_LD/VPTE 

e Integer queue issues a HW_MTPR ITB_PTE which 1s data-dependent on the HW_LD/VPTE 
and is required in order to fill the ITB. The HW_MTPR, when enqueued, set IPR scoreboard 
hits <4> ard <0>. 

e On issue of HW_MTPR ITB_PTE, IBOX signals CBOX that a HW _LDNNPTE has been 
processed and causes CBOX to begin “MB” processing. The MBOX stalls off any IBOX 
fetching from the time that the HW_LD/VPTE finishes until the probe queue is drained. 

e When “MB” processing is complete (one of the above sequences, depending on SYS_MB), 
CBOX< signals IBOX to clear IPR scoreboard bit <0>. In addition, the MBOX signals the 
IBOX to begin fetching. 


3.2.6 Load/Locked and Store/Conditional 


EV6 doesn’t contain a dedicated lock register, nor are System components required to do so. When a 
LDx_L instruction executes, data is accessed from the D or Bcache. If there is a cache miss, data is 
accessed from memory with a RdBIk command. . When the store-conditional executes, it is allowed to 
succeed if its associated cache line is still present in the Dcache and can be made writeable, otherwise it 
fails. This works since if another agent in the System wrote to the cache line between the load-lock and 
the store conditional then the cache line would have been invalidated. There are a host of further 
complications however. 
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A load-lock and its matching store-conditional must issue in program order. 

The stWait logic in the IQ is used to ensure that a store conditional always issues 
after an older load-lock. The stWait logic treats load-locks like stores, and store 
conditionals are always loaded into the IQ with their associated stWait bit set. 


I-stream references can’t evict the locked cache line. 
If an Icache fill request misses the Bcache but maps to the same Bcache line as an 


Be Nek We pew eek eee dy ae ee 


address which is held in the Dcache, then the I-stream request is sent to the System 


port as a non-cached fetch, and the I-stream line is not allocated into the Bcache. 


Loads or stores that are older than the load-lock but issue after it can’t evict the 
locked cache line. 
The Mbox recognizes this case and invokes a replay trap on the incoming load or 


store, which also aborts the load-lock. These instructions issue in program order the 


next time down the pipe. 


If the instruction fetcher predicts that a branch between a load-lock and a store 


conditional will be taken, and the branch is not taken, then a load or store executed 


on this mispredicted path can’t evict the locked cache line. 

There is a bit in the instruction fetcher which is set on a load-lock and cleared on 
any other Memory reference instruction. When this bit is set the branch predictor 
forces all branches to be predicted as fall through. 


Loads or stores which are newer than the store-conditional can’t evict the locked line 


The Ibox ensures that a store-conditional issues before any newer load or store by 
placing the store-conditional into the IQ and stalling all subsequent instructions in 
the map stage of the pipe until the IQ is empty. This allows the Mbox to prevent 
newer loads and stores from evicting the cache line associated with the store 
conditional. 


If two store-conditionals execute without an intervening load-lock, the second store- 


conditional must always fail. (Store conditionals to I/O will ALWAYS succeed) 


The register map logic contains a bit which is set by load-locks and cleared by store- 


conditionals. If the bit is cleared when a store conditional instruction is mapped, 


then the store-conditional is forced to fail. The mapper updates the value of the bit as 


appropriate when pipeline aborts occur. 


~ There must be no live-lock conditions in multiprocessor Systems. 
If a store conditional misses the Dcache then no System port transaction is launched, 


and the store conditional fails. 


If the store conditional hits a block which isn’t dirty, then a ChangeToDirty is 


launched only after the store conditional instruction retires and all older store queue 


entries are in the writable state . This ensures that once the ChangeToDirty is 


launched on behalf of the store-conditional that the store conditional will be executed 


to completion if the ChangeToDirty passes. 


If the ChangeToDirty passes, the store-conditional enters the writable state, and the 
Mbox locks down the Dcache line and does not release it until the store-conditional’s 


data is transferred into the Dcache. 


If the Cbox launches a CleanToDirty command for the locked block to the System 


port and another agent reads the block before the CleanToDirty hits the serialization 
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point in the System, then the System will cause the CleanToDirty to fail. 

In this case EV6 will launch a SharedToDirty command to the System against the 
locked block. This ensures that other agents do not cause the store-conditional to fail 
just by reading the locked block. 
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3.3 System Port 


The System port is EV6’s connection to either a local Memory/IO controller or a shared multiprocessor 
system controller. The System port consists of two uni-directional address and command busses 
(SysAddIn<14:0>, SysAddOut<14:0> ), a bi-directional data bus (SysData<63:0>, SysCheck<7:0>), 
single-ended uni-directional clocks, and a few control pins. All SysAdd and SysData signals are driven 
from EV6 with low assertion levels. Systems must receive and drive low asserted signals. 


3.3.1 System Port Pins 





Pin Name type Coun Description — 


PPT iit rr eee re ee or eo ore or 


SysAddIn<14:0>_L 
SysFill Valid_L 
SysAddinClk_L 
SysAddOut<14:0>_L 
SysAddOutClk_L 


15‘ time-muxed Command/Address/ID/Ack System to EV6 bus 
1 validation for fill given in previous SysDC command 
2 single-ended forwarded clock from System for above signals 
15 _‘ time-muxed Command/Address/ID/Mask EV6 to System Bus 
2 single-ended forwarded clock output for above signals 


data bus for Memory and IO data 

QW ECC check bits for SysData 

8 System generated clocks for clock forwarded SysData i in 

8 EV6 generated clocks for clock forwarded SysData out 
Marks a valid data cycle for data transfers to EV6 when asserted: 
Marks a valid data cycle for data transfers from EV6 when 
asserted 


SysCheck<7:0>_L 
SysDataInClk_L © 
SysDataOutClk_L 
SysDatalInValid_L 


I 
I 
I 
O 
O 
SysData<63:0>_L B 
B 
I 
O 
I 
SysDataOutValid_L I 


PLPABRAASIS, 








3.3.1.1 Legend: I= input, O = output, B= Bi-directional 


3.3.2 EV6 to System Address/Command Format 


Command, Address, ID and Mask are sent in four consecutive cycles. EV6 can be configured to send two 
different combinations of PA bits in the four cycle command , the goal being to give the System the PA 
bits that let it do Memory bank select and RAS address drive as fast as possible. The ID is the’miss 
address file (MAF), victim buffer or IO write buffer number associated with the command. The mask 
indicates the accessed bytes, longwords or quadwords for an IO space reference. Commands with Victims 
are sent as an atomic pair of standard format commands, if the CBOX IPR be_rdvictim is set. 


3.3.2.1 Bank Interleave On Cache Block Boundary 


| |S ysAddOur<i4:2> | SysAddOut<i> | SysAddOut<0> | 
| Cycle 1 | Mi | Command<4:0> | PA<34:28> | PA&36>_— | PAK38>__ 
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3.3.2.2 Page Mode Hit 






|| SysAddOut<i4:2> | SysAddOut<1>_| SysAddOut<0> _| 
| Cycle 1_| M1_| Command<4:0>_| PA<31:25> 







| Field definitions are: 





| SysAddOut Field 


One nmeereeccceesenvccetococenssnccssecencoses 


Command<4:0> 
SysAddOut<1:0> 


ID<2:0> 
RV 


Mask<7:0> 


Definition 


oO enw ss ceecesenan cess seereenveseceesseaensasess wanacees see ssesesesereeseeeegaass esses eee sen ser essOeseeseeeesesas esses seers sensessseeCesetessesessereseserereenarorsroonrnsncoenaeessneee 


reports a miss to the System for the oldest probe when =1. Has no meaning when = 0. 
the five bit command field 


is only required for Systems with greater than 32 Gbyte (up to a maximum of 8 
Terabyte) memories. This will allow cost focused systems to use a 13 bit 
command/address field. 


reports a miss to the System for the oldest probe when = 1, additionally it is asserted 
for Invalidates or set shared commands that have no data movement.. M2 has no 
meaning when = 0. Assertion of both M1 and M2 will not occur. (Reporting probe 
results is timing critical so when a result is known, EV6 will take the earliest 
opportunity to send a M signal to the system. M bit assertion can occur either in a.- 
valid command or a NZNOP) : 


the MAF, VDB, or Write I/O buffer id number associated with the command 


validates this command, in (optional) speculative read mode RV= 1 validates the 
command and RV=0 is a NOP. RV isa 1 for all non-speculative commands. 


the byte, LW or QW mask field for corresponding I/O commands 


cache hit bit that is asserted along with M2 when probes with no data movement hit 
in the D or B cache. A probe with no data movement can be an Invalidate or a 
ReadifDirty that hits on a valid but clean or shared block. 





3.3.3. SysAdd Commands Generated by EV6 





Command 


Peewee ranecsecccwscsacerecccccossacsesccconse 


ProbeResponse 


NZNOP 
VDBFlushRequest 


Command Function 


stances Baa rc a a he ht eee a etal 
00000 EV6 drives this on idle cycles 
00001 Returns probe status and Victim Buffer number holding the 
requested cache block. 
00010 Non_zero NOP, helps parse command packet 
00011 Victim Data Buffer Flush Request. EV6 sends this command to the 


System when an internally generated reference hits a Bcache victim 
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Command 


MB 


RdBlk 
RdBlkMod 
RdBIkI 
FetchBlk 


RdBikSpec 
RdBlkModSpec 
RdBlkSpecI 
FetchBlkSpec 


RdBlkVic 
RdBlkModVic 
RdBlkVicl 


WrVictimBlik 
Clean VictimBlk 
Evict | 
RdBytes 
RdLWs 

RdQWs 
WrBytes 


WrLWs 
WrQWs 


CleanToDirty 


SharedToDirty 


STCChangeToDirty 


InvalToDirty 


Command 
<4:0> 


00111 


10000 
10001 
10010 
10011 


10100 
10101 
10110 
10111 


11000 
11001 
11010 


00100 
00101 


00110 
01000 
01001 
01010 
01100 
01101 
01110 
11100 
11101 
11110 


11111 


eee a wp een ee weeny eee Sette 








Function 


or Probe in the VDB. The System should flush VDB entries 
associated with all probes and WrVictimBlks which occurred before 
this command. 

Indicates a MB was issued, optional when SYS_MB is set 


Memory Read 

Memory Read, modify intent 
Memory Read for I-stream, optional 
Memory Uncached RdBlk 


Memory Read speculative, optional 

Memory Read, speculative, modify intent,optional 
Memory Read for I-stream, optional - 

Memory Uncached RdBlk, speculative 


Memory Read with a victim- optional 
Memory Read, modify intent, victim - optional 
Memory Read for I-stream with a victim - optional 


Writeback of Dirty Block 

Address of a Clean Victim, optional mode used in directory 
Systems 

Duplicate Tag Invalidate, optional 


IO Read, Byte mask 
IO Read, LW mask 
IO Read, QW mask 
IO Write, Byte mask 
IO Write, LW mask 
IO Write, QW mask 


Sets a block dirty that was previously Clean, optional for duplicate 
Tags 

Sets a block dirty that was previously Shared, optional for MP 
Systems a 

Sets a block dirty that was previously Clean or Shared fora STx_C , 
optional for MP Systems 

Acts like a RdBIkMod without the fill cycles, optional for MP 
Systems InvalToDirty has a victim - optional 


Systems can optionally enable RdBlkVic and RdBlkModVic commands . In this mode the RdBIkxVic 
command cycles are always followed immediately by the WrVictimBlk commands. Also, when 
CleanVictimBlk commands are enabled they immeditaely follow RdBlkVic and RdBIkModVic 
commands. Speculative Rds in RdBlk victim mode will not create victims, this is useful for TurboLaser 
and TurboLaser follow-ons. 
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3.3.4 Probe Response Transfers 


EV6 responds to System probes that did not miss with a four cycle transfer on the SysAddOut bus. The 
| format of the probe response is shown below: 





Command<4:0> Identifies transfer as probe response 
DM Indicates that data movement should occur (copy of Probe valid bit) 
VS Write Victim Sent bit 
VDB<2:0> VDB (Victim Data Buffer ) entry containing the requested cache block. 
this field is valid when either the DM bit or the VS bit = 1 
MS MAF address sent 
MAF<2:0> MAF entry which matched against the probe address 
Status<1:0> Result of probe: 
- 00 HitClean 
01 HitShared 
10 HitDirty 
ll HitSharedDirty 


The System retrieves data from EV6 for probes that requested a cache block by using the SysDC wires. 
Probes which respond with M1 or M2 set will never be reported to the System in a Probe Response 
command. 


3.3.5 SysAck & System Port Flow Control 


Flow control of EV6-generated System port commands is done via the “A” bit , which is driven by the 
System, and a counter internal to EV6. EV6 increments its “command outstanding” counter every time it 
sends a command to the System. It increments this counter by two for RdBlkVic commands. EV6 
decrements the counter by one each time the System asserts “A” (SysAddIn<14> cycle 4 of the probe 
command or cycle 2 of the SysDc command). EV6 stops sending new commands when the counter hits 
the maximum count specified by the sysbus_ack_limit field in the CBOX IPR. EV6 will not send a 
RdBlkxVic command if the counter is equal to one less than the maximum outstanding count. There is no 
mechanism for the System to reject a command that has been sent. ProbeResponse, VDBFlushReq, NOP, 
NZNOP and a RdBlkxSpec with a clear RV bit will not increment the “command outstanding” counter 
and will therefore not require an “ACK” from the system. Systems must provide adequate resources for © 
responses to all probes sent to EV6. Additionally, there is a CBOX IPR that when set will not increment 
the outstanding command counter for RdBIkVic, RdBikModVic and RdBikVicl command. This is the 
“rdvic_ack_inhibit” bit. 


3.3.6 SysReadValid and Speculative Reads 


Systems can configure EV6 to send Memory space RdBlkSpec and RdBlkModSpec commands before EV6 
has determined that the read has missed the Bcache. SysAddOut<14> of the fourth command address 
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cycle contains the RV bit for that transaction. When configured for speculative reads, RV=0 indicates a 
NOP for that command and RV=1 validates that command. Systems not opting for speculative reads will 
always have RV=1. A RdBlkSpec or RdBlkModSpec with a clear RV bit will not increment the 
outstanding command counter and therefore systems must not send an ACK to EV6 for these commands. 


3.3.7 SysAdd Commands Generated by the System 


The commands driven by the System to EV6 are generically called probes and data movement commands. 
There are two formats for the SysAddIn bus that specify probes and data movement. Probes are always 4 
cycle commands that also contain a field to include a valid SysDc command. The format of the four cycle 
command is shown below. Note that SysAddIn<1:0> are optional and are used for Memory designs 
greater than 32 Gbytes. The position of the address bits matches the selected format of the SysAddOut 
bus. The example below shows the bank interleave format. 


ee 
| Cycle3 | 0 | SysDe<4:0> | RVB_| RPB | A | ID<3:0>_|  PA<40>__|_—PA<42>_— 














| 
| 
SysAddIn Field Description | 
Probe<4:0> Probe Type and Next Tag State (See table below) 
SysDc<4:0> Controls data movement in out of EV6, See section 3.3.7 for details 
RVB Clears Victim Buffer or WRIO buffer valid bit specified in ID<3:0> 
RPB Clears Probe Buffer valid bit specified in ID<2:0> 
A Command Ack bit that decrements EV6 command outstanding counter 
ID<3:0> Identifies VDB number or WRIO buffer number, <3> is asserted for 
WRIO only 
C (COMMIT) Commit bit that decrements the uncommitted event counter used for 


Memory Barrier acknowledge. 


The command field of a probe has 2 Goran the first sets the data movement, the second determines the 
next cache block state: 





Probe<4:3> Data Movement Function 

00 Nop 

01 Read if Hit, supply data to system if block is valid 

10 Read if Dirty, supply data to system if block is valid/dirty 
11 Read Anyway,supply data to system at index of probe 
Probe<2:0> Next Tag State 

000 Nop 

001 Clean 

010 Clean/Shared 

011 Transition3: 


Clean->Clean/Shared, 
Dirty->Invalid 
Dirty/Shared->Clean/Shared 
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100 Dirty/Shared 


101 Invalid 

110 Transition1: 
Clean->Clean/Shared, 
Dirty->Dirty/Shared 

111 Transition2: 
Clean->Clean/Shared, 
Dirty->Clean/Shared 

Next Tag State notes: 


Transition! is useful in non-duplicate tag Systems that do not update Memory on RdBIk hits to a dirty 
block. 

Transition2 is useful in non-duplicate tag Systems that update Memory on RdBIk hits to a dirty block. 

Transition3 is useful in non-duplicate tag Systems that want to give writeable status to the reader and do 
not know if the block is clean or dirty. 


EV6 holds pending probe commands in a 8 entry deep probe queue. The System must keep track of how 
many probes were sent and not overrun EV6’s queue. Probes are removed from the internal probe queue 
when the probe response is sent. 
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3.3.8 Two Cycle Commands For Data Transfers 


As mentioned above, there are two formats for the SysAddIn bus. The second format is a two cycle 
transfer for data movement commands. The SysDC command field contained within a two cycle format 
control movement of data in and out of EV6, success/failure for ChangeToDirty and MB commands, and 
error conditions. The data transfers must begin in the first SysData cycle which occurs 9 CPU cycles after 
the start of the SysAdd cycle in which the ID command was received. The pattern of data is controlled by 
the SysDataInValid and SysDataOutValid signals. These signals valid each cycle of data transfer and are 
used to put gaps in the data pattern. The timing is described in Section 3.3.9.1. The format of the two 
cycle SysAddIn transfer is shown below: 


aa SysAddIn<14:2> SysAddIn<1> | SysAddIn<0> 
oes LSS 
iCyke2 | C |  Xr——C—C—‘iLC‘$EUN CSXT eX 





SysDC Command SysDc Description 
Sha ateDacs Seat Ratenena it stad SO tape tsel Ecce srt 8 end cee rea a aot. 

Nop 00000 Nop, SysData ignored by EV6 

ReadDataError 00001 Data returned for Reads, System Drives SysData bus, I/O or Em 
NXM 

ChangeToDirtySuccess 00100 — no data, SysData ignored by EV6, also used for InvalToDirty - — 
response 

ChangeToDirtyFail 00101 no data, SysData ignored by EV6, also used for Evict response 

MBDone 00110 Memory barrier completed 

ReleaseBuffer 00111 Command to alert EV6 that RVB, RPB and the ID field are valid. 


ReadData (System Wrap) 100xx Data Returned for Reads, System Drives SysData. Systems define 
wrap order using SysDc<1:0> See section 3.3.9.6 on Data 
Wrapping. 

ReadDataDirty 101xx Data Returned for Readx and ReadxMods, Ending Tag Status is 
Dirty, as above - System defines wrap order. 

ReadDataShared(System 110xx Data Returned for Reads, System Drives Data, Tag marked Shared. 


Wrap) Systems define wrap order using SysDc<1:0>. 
ReadDataShared/Dirty 111xx Data returned for ReadBlk, Ending Tag status is SEE: 

. Beek -as above - System defines.wrap order 
WriteData 010xx Data sent for EV6 Writes or System Probe. EV6 drives SysData 


bus. Lower two bits of the command specify the quadword address 
around which EV6 should wrap the data. 


There are 8 victim buffers in EV6. These victim buffers are used for both victims (fills that are replacing 
dirty cache blocks) and for System probes that require data movement. The CleanVictim command 
(optional) will also assign a victim data buffer. Each buffer will have two valid bits that denote the buffer 
is valid for a Victim or valid for a Probe or valid for both Victim and Probe. Probe commands that address 
match a VAF entry with an asserted Probe valid bit (P) will stall the EV6 probe queue. No probe 
responses will be retumed until the P bit is clear. RVB(Release Victim Buffer), when asserted , will 
clear the Victim valid bit on the Victim Data Buffer (VDB) specified in the ID field. RVB bit will also 
clear the WRIO buffer when systems move data on I/O writes. RPB(Release Probe Buffer), when asserted , 
will clear the Probe valid bit on the Victim Data Buffer (VDB) specified in the ID field. Read data 
commands and victim write command use I.Ds 0-7 while Ids 8-11 are used to address the 4 IO write 
buffers. 
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“A” in the first cycle is command acknowledge used to decrement the EV6 “command outstanding” 
counter, but is not necessarily related to the current SysDc command. 


Probe commands can have a combined SysDc command along with MBDone. In that event, the probe is 
considered ahead of the SysDc command . In particular, if the SysDc command allows EV6 to retire an 
instruction before an MB, or allows EV6 to retire an MB itself (SysDc is MBDone), that MB will not 


return a dirty block of data to a RdBlk command. Systems that return a clean block in response to a 
RdBikMod may cause a livelock; it is, therefore, not recommended. Finally, Systems may not cause a 
STx_C failure on any other processor in the system when returning a dirty block in response to a RdBlk. 


3.3.9 Data Movement In and Out of EV6 


There are two modes of operation that pertain to data movement in and out of EV6. These modes are 
selected with the CBOX csr called FAST_.MODE_DISABLE. Fast data mode allows movement 

of data from EV6 to bypass protocol and achieve lowest possible latency for probes data, write victims 
and I/O writes. Rules and conditions for each mode is as follows: 


3.3.9.1 Fast Data Mode 


EV6 is the default driver of the bi-directional SysData bus. As EV6 is processing WrVictim, Probe 
Response and WRIO commands tc ihe system, data relative to this command is made available at the - 
clock forwarded pin bus. SysDc commands that turn the SysData bus around may interrupt the successful 
completion of the ‘fast’ transfer. Systems are responsible to detect and replay all interrupted ‘fast’ 
transfers. There are no gaps in a ‘fast’ transfer and no wrapping (the first cycle contains QWO addressed 
by 5:3 = 000#2). 


Finally, systems must release victim buffers, probe buffers and WRIO buffers by sending a SysDc 
command with the appropriate RVB/RPB bit for both successful ‘fast’ transfers and for transfers that have 
been replayed. Fast transfers have two components, (1) the SysAddOut command with the probe 
response, WrVictim, or Wr(I/O) and (2) data. The command precedes data by, at least, one Framing 
Clock. The matrix below shows the number of Framing Clocks between SysAddOut and SysData for all 
System clock ratios (clock forwarded bit times) and Framing clock multiples. 






: CLOCK FGRWARD BIT TIME (System Clock Ratio 
Bit Time/Framing Clock | 25 | 30 | 35 | 40 
4 ae eae 


| 4 
See oe: RE ae 








| 2.0 | 


The timing diagram below show a simple example of a ‘fast’ transfer. This is an example of a system 
clock ratio of 1.5 and 4 bit times/Framing clock. 


SysDc [~~ X_ProbXRes¥onse XX XX XX XXX) 


complete until the probe is executed. 

Systems must assert appropriate SysDc command for correct ending Tag status. Systems may elect to 
| 
| 
SysData [—_X_X_X_X__Xia-_Xbt_X2_X3_XHA_K eta XXX 
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Movement of Data into EV6 involves careful timing to turnaround the SysData bus that is being driven by 
EV6. EV6 will respond to the SysDc command that always precedes the movement of data into EV6. 
Both the SysDc command and the first cycle of data are sent on System Framing clock boundaries(rising 
or falling edges). The total minimum number of Framing clocks between SysDc and data can be 
calculated as follows: 

1. The fixed minimum delay in EV6 between the receipt of SysDc and the capture of the first piece of 
data is 9 processor cycles. So, Fixed Delay (FD) = (EV6 Cycle Time * 9). 

2. Settle Time is the electrical bus settle time requirement that depends upon the maximum distance 
between EV6 and the furthest data chip. It is the round trip delay on the bi-directional data bus and 
can be calculated as : Settle Time (ST) =(2*max distance(in)) * 200 psec/in. 

3. Clock skew between the EV6 Framing clock and the System Framing clock is a factor in the 
turnaround time. Total Skew (Tskew) = EV6 skew(4.0 nsec) + System skew(?). 

4. Tocalculate the total number of Framing clocks between SysDc and data, take the sum of the three 
delay components above and divide that by the period of the Framing clock. and round UP to the next 
half or whole Framing clock.. The Equation is as follows 


# Framing Clocks = (FD +ST +TSkew)/Framing Clock period 
Example: CPU Cycle time = 2 nsec 
Framing Clock = 12 nsec 
Max Distance = 10 inches 
Total Skew = 4,5 nsec 
# Framing Clocks = ((2.0 nsec *9) + (20 inches*200 psec.in) + (4.5 nsec) / 12 nsec 
# Framing Clocks = (18 nsec + 4.0 nsec + 4.5 nsec) /12 nsec = 26.Snsec /12 nsec 
# Framing Clocks = 2.2 rounded up to either 2.5 or 3. 


The following timing diagram illustrates data movement into EV6 using the results of the sample 
calculation shown above. The bottom trace in the diagram illustrates the EV6 internal clock with text 
indicating the 9 fixed processor cycles preceded by 6 delay (w) cycles. Systems use the wait cycles to 
delay the perception of SysDc so that the first piece of data arrives in time to be sampled by EV6. There is 
a4 bit CBOX IPR called sysdc_delay that is used to fine tune the interface timing in the manner shown 
below. The delay does not effect SysData bandwidth. 


Frame Clock f\__ AON NNT 
lis sDc off rise of Frame Clock 
Sys) ae ED ae SS 






SysAddin 


EV6 Rev 0 


Data Off Fall Of Frame Clock 
SysData 


T Del fD 
-EV6 Rev 0 ransport Delay o ata 


EV6 Clock AAV /\/\/\ AAA ARAA\ AAA OSA /BAD IDI 


If a fast transfer is interrupted and fails to complete, the system must use the conventional protocol by 
sending EV6 a SysDc command of WriteData to removed the desired data buffer. The following section 
will describe the timing events for transferring data from EV6 to the system. 


3.3.9.2 Fast Data Disable Mode 


The system controls all data movement to and from EV6. Movement of data into and out of EV6 is 
preceded by aSysDc command. EV6 drivers are enabled only for the duration of an 8 cycle transfer of 
data from EV6 to the system. Systems must insure there is no overlap of enabled drivers and that there is 
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adequate settle time on the SysData bus. As described above, systems must insure there is proper settle 
time when transitioning the SysData bus from write to read and from read to write. Settle time is 
measured from the point where EV6 sends the last quadword of data to the point when the system begins 
to transfer its first quadword (and vice versa). Settle time is an electrical constraint that can be calculated 
by multiplying the round trip distance of the furthest data chip from EV6 (in inches) times 200 psec/inch. 


The diagram below shows the transferof data into EV6 on an idle SysData bus. 
Frame Clock fF QFN fF 


Liconmand sent off Frame Clock 
SysAdd/Cmd [~~ ysysDx__ XX SX” 


Cmd Receiver 





Data sent off fall of Frame Clock 
(DO XDI Xd2 XD3S XD4 >} 


espana 


SysData 





DataO Receiver 


When in Fast Data Disable mode, systems move data from EV6 with a WriteData SysDc command. This 
is used for removing data from a probe buffer, a victim buffer of a WRIO buffer. There is a fixed timing 
relationship from the point where EV6 receives the SysDc command until it drives the first QW on the 
SysData bus. Seven cycles after receiving the SysDc, EV6 looks for the rising edge of the next Framing 
clock. The example below shows the SysDc sent from the system on the rising edge of the first Framing 
clock with EV6 driving data two framing clocks later. This delay is fixed and uninterruptable much like 
reading a ram. Note there is some skew between the EV6 Frame clock and the System Frame clock. 


3.3.9.4 SysDataIn/OutValid 


There are two signals that are sourced by the system that control the rate of data delivery to and from 
EV6. These signals are associated with the address/cmd and have data bus timing attributes. Each signal 
represents a 64 bit quantity of data. For a complete transfer of data, EV6 must see the DataValid signals 
asserted for 8 data cycles. There can be any number of leading zero (deasserted cycles) and any pattern of 
gaps between valid cycles. Once the 8" cycle of an asserted DataValid signal is perceived by EV6, the 
transfer is considered complete. Minimal !atency is achieved when the SysDataIn/OuiValid signal is 
asserted in the same cycle as the SysDc command. Both SysDataln/OutValid are ‘don’t cares’ when not 
accompanied by a SysDc command. Systems may elect to drive and receive data at the lowest latency and 
highest bandwidth by asserting both SysDataIn/OutValid continuously. 


EV6 expects to clock a valid data word on the 9" CPU clock after clocking the associated SysDataValidin 
signal. Systems must ensure that data does not arrive too early. The following diagram illustrates a 
system transfer of a block of data into EV6. 


PRAM ClOCK: fr nt J Ne ep Na ef 
SysAddin “x“sysKe XX. XXX XK OKOKOKOKOK SD 
SysDatainValid =f \ sf F*Fhee fT F*"T 
SysData (Do X XD1 XD2 XD3 XX D4XDS5 
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The timing relationship is slightly different for transfers out of EV6. EV6 will drive the first piece of 
data on the rise of the ‘Framing’ clock 7 CPU cycles after perceiving the first SysDataValidOut signal. 
From that point forward, every cycle of deasserted DataValidOut will cause a one cycle gap in the data 
transfer. 


SysDc commands that do not move data into EV6 but modify cache tag state must allow a 2 data cycle 
window in the data bus by asserting the SysDataValidIn signal for 2 clock forwarded cycles. These 
commands are success acknowledge for ChangeToDirty and InvalToDirty commands. Systems that elect 
to tie this signal high (always asserted) must allow for a two cycle gap in the SysData bus when doing 
these commands. 


Systems that elect to control the rate at which EV6 delivers data to the system, must set the 
add_frame_select register to 0. This means that all transmissions from EV6 to the system ( address and 
data, will ignore framing clock edges and will commence on the earliest SysAddCikOut or 
SysDataClkOut. . 


3.3.9.5 SysFillValid 


SysFillValid, when asserted validates the current memory and I/O data transfer into EV6. Systems may 
elect to tie this pin to a logical 1 to always assert valid fills or use it dynamically to enable or cancel fills 
as they progress. The net effect of this signal is to allow MP systems some additional time to attain probe 
results. EV6 will sample the value of SysFill Valid at D1 time (the point at which EV6 samples the second 
data cycle). If SysFillValid is asserted at D1 time, the fill will continue uninterrupted. If it is 

_ unasserted, EV6 will cancel the fill but maintain the valid MAF entry until a successful fill occurs. The __- 
timing diagram below illustrates SysFill Valid. 


SysAddin —yS¥SpOX XX XXX XIE XE XE) 


Transport Delay on Address 
Cmd Receiver EE FE 


SysFillValid a, 
SysData > XX XS XB) 





3.3.9.6 Data Wrapping 


All data movement between EV6 and the system is one size only and that is 64 bytes or 8 complete cycles 
on the data bus. EV6 will generate memory read and write addresses that point to the desired octaword. 
All 64 bytes of memory data are valid. This applies to memory reads, memory writes and system probe 
reads. 


I/O read and write addresses on the SysAddOut bus will point to the desired Byte, Word, Longword or 
Quadword , with a combination of address bits 5:3 and the mask field <7:0>. That combination is defined 
as follows: 


| Command _| Significant Address Bits_| Mask Type | ss CRules, 
bits 5:3 will contain the exact PA bits of 
the first LDQ or STQ to the block. The 
Mask bits point to the valid Qws merged 
in ascending order 
bits <5:3> will contain the exact PA bits of 
the first LDL or STL to the block. The 

Mask bits point to the valid Lws merged in 


















RdQWs and SysAddOut<5:3> QW 
WrQWs 


RdLWs and SysAddOut<5:3> LW 
WrLWs 
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Po ascending order within one hexword. 
















LDByte/Word bits <5:3> will contain the exact PA bits of 
and SysAddOut<5:3> BYTE the LDByte/Word or STByte/Word. 
STByte/Word One byte mask for byte operation and two 


for word. No merging. 








| The order in which data is given to EV6 (in the case of a memory or I/O fill) or moved from EV6 (write 
victims or probe reads) is determined by the system. Systems can choose to reflect back the same low- 
order address bits and the corresponding octaword or any other starting point within the block. 


SysDc commands for ReadData, ReadDataShared and WriteData require that systems define the position 
of the 1* QW by inserting the appropriate SysAddOut<5:3> into bits <1:0> of the command field. The 
recommended starting point is the quadword pointed to by EV6, however, some systems may find it more 
beneficial to begin the transfer elsewhere. The key point is that EV6 must always be told what the starting 
point is and the wrap order for all subsequent quadwords is always interleaved. The following table will 
define the method for systems to specify wrap and deliver data: 


| Source/Dest_| __SysDe<4:2>_— | SysDe<1:0>__ | Size |_——Rules__ 
— 
[Mey | ToReadbastared) | SysndiOucseo | Block Bie) [ See Nad | 
) 
) 













111(Read DataShared/ | SysAddOut<5:4> | Block (64 Bytes) | See Note 1 
Dirt a 

010(WriteData) SysAddOut<5:4> | Block (64 Bytes 

yO 100 (ReadData) SysAddOut<5:4> | QW (8 -64Bytes 











i 
|v [o1ocwrieDaay—[ Sysnasoucsi | LW e Maye) | SeeNoeT | 
[2 | trocwrbaay | SyanaiOucse> | ByeWord | See Now 





NOTE I Transfers to and from EV6 are 8 data cycles for 8 total Qws. The starting point is defined by the 
system. The preferred starting point is the one pointed to by SysAddOut <5:4>. Systems can insert the bits 
<5:4> into bit <1:0> of the SysDc command. The wrap order is ‘interleaved’ as defined by the table 


below. 
PA Bits <5:3> of Transferred QW 
1° QW 000 010 100 110 
2°? QW 001 011 101 111 
3" QW 010 000 110 100 
4" QW 011 001 111 101 
5" OW 100 110 000 010 
6" QW 101 111 001 011 
7 QW 110 100 010 000 
8" QW 111 101 011 001 
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Note 2 Longword and Byte/Word reads differ from all other transfers. Systems unload only 4 Qws of 
data into 8 data cycles by sending each QW twice. The first QW returned is determined by 
~ SysAddOut bits <4:3>. Systems again may elect to choose their own starting point for the 
transfer and insert that value into SysDc<1:0>. The wrap order for “double pumped” transfers is 
interleaved as defined by the table below. 


PA Bits <5:3> of Transferred QW 


1° QW x00 x01 x10 x11 
2" QW x00 x01 x10 x11 
3° QW x01 x00 x11 x10 
4" QW x01 x00 x11 x10 
5" QW x10 x11 x00 x01 
6"QW x10 xd x00 x01 
7" QW x11 x10 x01 x00 
8" QW x11 x10 x01 x00 


3.3.10 Data ECC 


EV6 supports a QW error correction code for the System Data bus. ECC is generated by the CPU for all ~ 
Memory write transactions (WrVictimBlk) emitted from EV6 and for all probe data. ECC is also checked 
for every Memory read for single bit correction and double bit error detection. Bcache data is checked for 
fills to the Dcache and for all Bcache to system transfers (victims and probes). 


I/O write data will not have a valid ECC (the ECC bits must be ignored by the System) and similarly, no 
checking is done on J/O read data. 


If the System indicates that Memory data should not be checked via mode setting in ECC _DISABLE in 
the MBOX CSR, then no checking or correcting is performed. 


3.3.10.1, ECC CODE. 


111111 1111 2222 2222 2233 3333 3333 4444 4444 4455 5555 5555 6666 cccc cccc 
0123 4567 8901 2345 6789 0123 4567 8901 2345 7689 0123 4567 8901 2345 6789 0123 0123 4567 


3.3.10.1.1 CBO 0111 0100 1101 0010 0111 0100 1101 0010 1000 1011 0010 1101 
1000 1011 0010 1101 1000 0000 | 


CB1 1110 1010 1010 1000 1110 1010 1010 1000 1110 1010 1010 1000 1110 1010 1010 1000 0100 0000 
CB2 1001 1001 01100101 1001 1001 0110 0101 1001 1001 0110 0101 1001 1001 0110 0101 0010 0000 
CB3 1100 0111 0001 1100 1100 0111 0001 1100 1100 0111 0001 1100 1100 0111 0001 1100 0001 0000 


CB4 0011 1111 0000 0011 0011 1111 0000 0011 0011 1111 0000 0011 0011 1111 0000 0011 0000 1000 
CBS 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0000 1111 1111 0000 0100 


CB6 1111 1111 0000 0000 0000 0000 1111 1111 1111 1111 0000 0000 0000 0000 1111 1111 0000 0010 
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CB7 1111 1111 0000 0000 0000 0000 1111 1111 0000 0000 1111 1111 1111 1111 0000 0000 0000 0001 


3.3.11 Ordering of System Port Transactions 


This section details transaction ordering issues as they relate to the System port. There are two classes of 
ordering considerations: 


EV6 commands and System probes 
System probes and SysDC transfers 


3.3.11.1 EV6 Commands and System Probes 


The issue to be addressed involves EV6-generated commands and System port probes which reference the 
same cache block. First, a few points: 


e EV6 commands reflect all probe responses sent and probe responses reflect aJl EV6 commands sent. 

e VAF (Victim Address File) and VDB (Victim Data Buffer) entries each have independent valid bits . - 
for both a Victim and a Probe. 

e Probe results indicate a hit on a VAF/VDB and whether or not the address has been sent. Systems can 
decide whether to move the buffer once or twice. 

e Probe responses are issued in the order that they were received, however, there is no requirement for 
the system to retain order when issuing release buffer commands. 

e Probe invalidates that match a valid VAF for which the address has been sent, will clear the VAF so 
that subsequent probes to this same cache block will NOT report a Hit VDB condition. The RVB is 
still required to release the VDB. 


The foliowing table lists all interactions between pending internal EV6 commands and probe commands, 
| and shows EV6’s response in each case. 






Probe: Next-State Command 
Invalid Clean Clean/ _ Dirty/ Nop Transl Trans2 
Shared Shared 


EV6 
Command 
RdBlk 











RdBIkMod 
FetchBik 
| InvalToDirty 
WrVictimBlk Table 1 Table 1 Tablel Table1 Table! Table1 Table1l 
_CleanToDirty | RGBIkMod . Table 2 2 ee Table 2 Table 2 
/fail STx_C 
SharedToDirty | RdBlkMod Table 2 
/fail STx_C 
e §=©Notes: 


e RdBlkVic and RdBikModVic do not appear in the above table. If the interaction is between the probe 
and victim then it’s the same as WrVictimBIk. 


e Probes that invalidate locked blocks will not result in a RdBIkMod command. EV6 must fail the 
STx_C as defined in the Alpha SRM. 


e All reads (RdBlk,RdBlkMod,Fetch, InvalToDirty) have no interaction as EV6 does not yet own the 
block. 


Legend for Table 1 and Table 2 is as follows: 


DM = data movement (0 := probe does not need data, 1 := Systems requires data) 
NslI = Next State Invalid (0 :=next state NOP or shared etc., 1:=Next State is Invalid) 
| AS = Address has been sent to the system 
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NOP =no buffer for probe 
Status<1:0> = EV6 response on Probe (00=HitClean,01=Hit/Shared, 10=HitDirty,11=Dirty/Shared) 


Type =Typeof Hit €MAF = MAF hit and Address sent, VDB= VDB hit and Address sent) 
sendV = send victim to System as usual 

killV =block is no longer considered a victim by EV6 

RVB~ =release the Victim Valid bit on the VDB 

RPB =~ =release the Probe Valid bit on the VDB 

moveV =move Victim data from EV6 to the System 

moveP =move Probe data from EV6 to the System 

suppress victim = don’t moveV or (write victim to Memory and guarentee System DMA write is last) 





cam § =Content address Memory ...a queue of addresses that are bit-for-bit compared with new entries 
Table 1 ; Probes that interact with WrVictimBlk 
DM NsI AS_ Status: -Type  —-—-—C EV6 Action System Action 
<1:0> ae 

0 0 0 NOP NOP SendV ;wait RVB moveV, RVB 

0 0 1 ;wait RVB moveV, RVB 

0 1 0 NOP NOP _ KillV NOP 

0 1 1 ;wait RVB RVB, suppress victim (VDB#cam) 
1 0 0 HitDirty NOP  SetP,SendV;waitRPB,RVB moveV/P, RPB, RVB 

1 0 1 HitDirty VDB_ SetP ;waitRPB,RVB moveV/P, RPB, RVB 

1 1 QO HitDirty NOP _ killV, SetP ;wait RPB moveP, RPB 

1 1 1 HitDirty VDB __ SetP, ;waitRPB,.RVB____moveP, RPB, RVB,suppress victim 


Table 1 Notes: 


1) Vafstate: SendV, Vvalid, Pvalid 
2) moveV/P depends on Systems, blocks could be moved once or twice 
3) Systems with address cams may clear both bits at the same time. 


System Notes: 


1) Tagless Uniprocessor -using P and V independently requires no address cams, the System can either 


compare VDB#s or observe the rule that Memory is written with the WrVictimBIk first followed by 
the DMA write. 


2) Tagless MP - address cams are needed to fail SharedToDirty commands 
3) Tagged MP - ; _— 


If ChangeToDirty commands are failed by probing duplicate tag, no address cams are needed. 
If the VDB is not released until the Victim address is on the Bus, no address cams are needed for 


new probes versus victims. 


Table 2 illustrates action taken by System and EV6 when a probe interacts with a ChangeToDirty 
command. A ChangeToDirty (XtoD) can be either a CleanToDirty (CtoD) or a 


SharedToDirty (StoD). 


Table 2 : Probes that interact with ChangeToDirty 


NsI AS Status Type EV6 Action System Action 
<1:0> 
0 O NOP NOP Send StoD if Next state = S 
0 1 NOP MAF If CtoD, System can succeed or fail 
1 0 NOP NOP Fail XtoD 
1 1 NOP MAF Fail XtoD 
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Table 3 illustrates the actions taken when a probe has a conflict with a pending fill. 


PFAS = Probe is first to arrive at serialization point 
PFT6 = Probe is sent first to EV6 


Table 3: Probes that interact with pending Memory refills 


PFAS _PFT6 ACTIONS 
0 0 send probe after fill 
0 1 tagged System option, read probes that hit in the MAF wait for fill to complete 
1 0 N/A 
1 1 Normal case, System waits for probe response and then send SysDC fill command. 


Note: Probe commands that contain a SysDc fill to the same address, are considered unordéred with 
repsect to the action they take on the cache. The Fill may occur before the probe or the probe may occur 
before the fill. 


3.3.12 System Port Clocking 


This chapter will define all aspects of clocking the EV6 processor. It will define the rules for input clock 
frequency, initialization and reset rules, system port clocking rules, and finally rules for entering and 
exiting low power. 


3.3.12.1 Input Oscillator 


EV6 has a nominal internal “CPU” clock rate of 500 MHz. It is produced as the result of a phase locked 

loop circuit with a frequency multiplying VCO generating 800 Mhz to 1.2Ghz that is divided by 2 

(nominally) and distributed throughout EV6 (GCLK). Systems provide an input frequency or 
CLKIN_H/L that is used by the PLL for phase alignment. CLKIN_H/L can range from 80 to 200 Mhz. . 
Systems will input a differential sinusoidal signal preferrably from a PLL that is also the source of the 
clock for the system interface logic. The electrical, jitter and phase alignment specification for 
CLKIN_H/L are described in the PLL section of the Electrical data chapter. 

| 

| 


There are three divisor circuits in the PLL loop. The X and Z divisor shown in the diagram below are 
controlled by an internal clock controller that steps up the frequency of the chip during power on/reset and 
also steps the frequency down and up for sleep mode. The Y divisor is set during reset by copying the 
values on IRQ<2:0> into a clock IPR. The Y divisor is never modified: Systems use the table below to 
select the appropriate Y divisor to establish the desired EV6 frequency. For example, if a system supplies 
a 100 Mhz CLK_IN and wants to run EV6 at 500 Mhz, it must establish a Y divisor of 5. 






GCLKFREQ | 3 | 4S 

400 | 133.33, | 100 | 80S | na 

| 24 | 416.66 | 138.889 | 104.167 | 83.333_ | ona 
238 | na 


Nw 
- 





434.8 108.696 86.956 
113.636 | 90.91 | na __| 
158.73 95.238 


p18 | 555.555 185.185 | 138.889 | 112.111 | 92.593 | na 
| 1666 | 600 | 200s | 50S | 120100 85.72 
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| 1s | 666.667 | na | 166.667_| 133.333_[ 111.11 | 95.238 | 83.3 


Note that the lowest frequency applied to the input of the PLL in normal operating mode is 80 MHz. 


3.3.12.2 System Clock or Framing CLock 


Systems are expected to run at an integer divisor of the oscillator input clock. V6 requires a skew 
controlled copy of the system clock as a reference or a Framing Clock. This clock is a single-ended 50% 
duty cycle clock. This clock is captured by EV6 at the deassertion of reset. The captured framing clock 
will track the internal Gclk. EV6 uses this clock to determine the start of the system clock cycle for both 
clock forwarded transfers. Addititionally, EV6 uses the Framing clock to do a synchronous reset of the 
Clock Forwarding circuit. Systems must chose generate a Framing clock with a period that can insure 
proper synchronous transfer of the clock forward reset. The following block diagram illustrates a _ 
representative clock distribution scheme for EV6 systems. 
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3.3.12.3 Clock Forwarding Definition 


Clock forwarding is a well known communication technique that allows for transfers at higher speed than 
traditional synchronous point-to-point communication would allow. EV6 has very high address and data 
bandwidth requirements coupled with limited signal pins. Clock forwarding is the method that can | 
overcome the limited pin availability and yet provide high bandwidth. Previous Alpha processors 
communicated to interface designs via a skew controlled synchronous clock. EV6 will send and receive 

data and address/cmd to/from systems accompanied by a single ended clock. There is one single ended 
clock for the address out (including SysFill Valid and DataValid In\Out) and one single ended clock for 

the address in. Further, each byte of the data bus has a single ended clock for EV6 to System transfers 

and for System to EV6 transfers. 


On the receiving end of the clock forwarded path, circuitry is sensitive to the forwarded clock such that it 
is used to strobe the flop of. the data that it accompanies. Additionally the receive circuit has a counter . 
that enables the receive flop and this counter is incremented by the received clock. What follows is a 
simple circuit illustrating a receive circuit for a single bit. 


FORWARD DATA IN 


FORWARD CLOCK IN 


as 
i TARGET 
D 
Q 


GCLK 
ae ERE Ook, [FORWARD CLOCK OUT 
1.5,2,2.5,3, 3.5, 4 








IPR Preset<1:0> 
Clik Fwd Rese 
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A simplified example of a sending circuit is shown below. 


GCLK 


FORWARD 
DATA OUT 


FORWARD 
CLOCK OUT 










CLOCK 
GCLK — GENERATOR 
IPR CLOCK SEL 
(1.5, 2, 2.5, 3, 
35, 4) 





FORWARD CLOCK 


3.3.12.4 Glossary of Terms . 
There are a number of terms that will be used repeatedly in this specification and require defining. 


BIT TIME - Specified in Nsec and pertains to the total time that a signal conveys a single valid piece of 
information. Since all data and command is associated with a clock and the receivers latch on both 
the rise and fall of the clock. Bit times are dcfined as a multiple of the EV6 clocks. Systems must 
produce a Bit time identical to EV6. 

FORWARD CLOCK - A single ended signal that is aligned with its associated fields. Sourced and 
aligned by the sender with a period that is 2 times the bit time. Forwarded clocks must be 50-50 duty 
cycle clocks whose rising and falling edges are aligned with the changing edge of the data. rae 

FRAMING CLOCK - The framing clock defines the start of a transmission either from the system to EV6 - 
or from EV6 to the system. The Framing clock is a power-of-2 integer multiple of the EV6 CPU 
clock. and is usually the system clock. The Framing clock and the input oscillator can have the same 
frequency. The add_frame_select IPR sets that ratio of bit times to Framing clock. The Frame clock 
could have a period that is 4 times the bit time with a add_frame_select of 2X. Transfers begin on the 
rising and falling edge of the Frame clock. This is useful for systems that have system clocks with a 
period to small to perform the synchronous reset of the clock forward logic. 

SYSTEM CLOCK - The primary skew controlled clock used throughout the interface components to 
clock transfer between ASICS, main memory and I/O bridges. 

GCLK - Global clock within EV6, the 2 nsec globally distributed clock with EV6 

INTERFACE RESET - A synchronously received reset signal that is used to preset and start the clock 
forwarding circuitry. During this reset, all forwarded clocks are stopped and the presettable count 
values are applied to the counters, than some number of cycles later the clocks are enabled and are 
free running. 

RECEIVE COUNTER- Counter used to enable the receive flops. It is clocked by the incoming forwarded 
clock and reset by the Interface Reset. 


Digital Confidential Do Not Copy . 57 





a Nr ep ewan nw eeniweny eee me: 


RECEIVE MUX COUNTER- The Receive Mux counter is preset to a selectable starting point and 
incremented by the locally generated Forward Clock. 

OUTPUT MUX COUNTER - Counter used to select the output mux that drives address and data. It is 
reset with the Interface Reset and incremented by a copy of the locally generated forwarded clock. 

CORRELATED SKEW - Uncertainty contributors that are track commonly . Examples of correlated 
skew might be a signal sourced from the same chip and sent to the same destination chip. The total 
system clock skew is correlated among this group of signals. Intra-die process variations are also 
correlated. 

UNCORRELATED SKEW - The mismatch between the delay of the forwarded clock and the forwarded 
data. There arc a number of contributors in the uncorrelated category whose total magnitude is a 
limit to the minimum bit time. Uncorrelated skew is what forces the forwarded clock out of 
alignment with respect to the data. 

TARGET CLOCK - Skew controlled clock which receives the output of the RECEIVE MUX . 

CLOCK OFFSET - or CikOffset is the delay intentionally added to the feewarded clock to meet the setup 
and hold requirements at the Receive Flop 


3.3.12.5 Clock Forwarding Bit times 


EV6 will derive its forwarded clock from the internal CPU otherwise known as GCLK. Systems can 
choose from one of six GCLK multiples for the forwarded clock. Those value are 1.5, 2, 2.5, 3,3.5 and 4. 


Systems must match their send and receive circuits with the BIT TIMES that it selects for EV6. If EV6 
is setup to drive data at a 3 nsec BIT TIME, then the system must send and receive at the same 3 nsec 
rate. Below is a table that show the bit times for all possible clock multiples and the standard six GCLK 
frequencies. 


EV6 FORWARD EV6 Internal Operating Frequency 
CLOCK 
MULTIPLIER 
lO EPA SERRE MRR Ree eT ee PETS ENE AN Oe, See ONT at ema en TO 
450MHz 500MHz 550MHz 600MHz 700MHz 
1.5 3.3 nsec 3 nsec 2.75 nsec 2.5 nsec 2.14 nsec 
2 4.4 nsec 4 nsec 4.63 nsec 3.33 nsec 2.84 nsec: 
2.5 5.5 nsec 5 nsec 4.545 nsec 4.16 nsec 3.5 nsec 
3 6.6 nsec 6 nsec 5.454 nsec 5 nsec 4.26 nsec 
3.5 7.7 nsec 7 nsec 6.363 nsec 5.833 nsec 4.97 nsec 
4.0 8.8 nsec 8 nsec 7.272 nsec 6.667 nsec 5.68 nsec 


Below is a timing diagram showing an example of the timing relationship of the three clocks (Framing 
clock, forwarding clock and GCLK) and the data and address bus. The frequency of the clock matches 
that of the data. Receive circuits must be designed to latch on both the rise and fall of the forwarded 
clock. Note that it does not illustrate correct protocol. 
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3.3.12.6 Principles Of Operation 


A single-ended clock accompanies at most 16 signals from a sender targeted at a receiver. All delay 
contributors that effect the propagation of the clock and the signal it represents are matched as closely as 
possible. The output stage and the drivers are closely matched to align the rise/fall of the single-ended 
clock with the front edge of the data signal. Systems must do a likewise matched circuit design for their 
sending circuits. 


3.3.12.6.1 Number of RECEIVE FLOPS required 


Receiving circuits for the system designs have one variable and that is the number of receive flops along 
with the associated size of the receive clock enable counter. The absolute minimum numbcr of receive 
flops (N) is the Target Clock Period/ BIT TIME and this would assume that there is no skew, no setup 
and no hold. The minimum number of receive flops (N)_ can be determined by the following equation: 


N = (Target Clock Period + (Max _delay - Min_delay) + TSkew + Tsetup+Thold) / BIT TIME 


ex. Target Clock Period (System clock) = 12 nsec 


Max_Delay(total worst case delay from sender to Target flop) = 9 nsec 
Min_Delay(total best casc delay from sender to Target flop) = 5 nsec 
TSkew (Total clock uncertainty between GCLK and System Clock = 5 nsec 
Tsetup (setup time of the Target flop) = 100 psec 

Thold (hold time of the Target flop) = 500 psec 

BIT TIME = 3 nsec 


N=(12 + (9-5) + 5+.1+.5) /3 = 21.6/3=7.2 rounded up to the next integer = 8 receive flops. 
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3.3.12.6.2 Maximum allowable skew and max to min delay difference. 


EV6 will have 4 receive flops for both the SysAddIn bus and the SysDatalIn bus. This number of receive 
flops combined with the bit time will determine the maximum allowable difference between the 
Min_Delay and the Max_Delay plus the total clock skew. The equation is as follows: 

4 * BIT TIME > (EV6 CYCLE TIME +(Max_delay-Min_delay) + TSkew + Tsetup+ Thold) 
BIT TIME = 3nsec 

EV6 Cycle Time =2 

Max_delay = 9 | 

Min_delay = 5 

TSkew = 4 

Tsetup = .100 

Thold = 0 

4*3 >(2+(9-5) +4 +.1) ..... 12 > 10.1 nsec 


3.3.12.6.3 Receive Mux Counter Preset value 


Another system selectable value is the preset value on the receive mux counter. The receive mux counter 
is incremented by a copy of the local Forward Clock so that it has a frequency equal to the bit rate and is 
skew aligned with GCLK. Since the counter must select a receive flop at the earliest point in its valid 
window, it is really determined by the maximum delay from the source clock within the senders ASIC to 
the output of the receive flop. The MAX DELAY between the sender and receiver can be greater than 
the turnover of the receive mux counter. In fact, itcan many times greater than the time taken to 
complete one cycle of the receive mux counter as long as all of the other requirements are met. The 
preset delay in nanoseconds is : : 


MAX_DELAY (nsec) mod (Bit Time * 4) = Preset Delay(nsec) 


Then the preset value which is loaded into the counter during clock forward reset is the two’s compliment 
of: | PRESET DELAY (nsec) / BIT TIME (nsec) 

Systems will have differing delays between the SysAddIn/Out buses and the SysDatabus. Since EV6 has 4 
receive flops for address and data, a wide valid window is created. This wide window can absorb delay 
differences due to placement up to 2 nsec or about 13 inches. Therefore, when systems calculate the 
receive mux counter preset value they should use the max delay of all the data and address busses. 


3.3.12.6.4 Minimum Bit TimeThe minimum bit time or the period of the forwarded data and clock 
(recall that the clock switches at the same rate as the data) is vital in arriving at the maximum supportable 
bandwidth of the interface. The limits are local, meaning within the specific set of signals and the 
associated clock. For EV6, minimum bit time is established by examining the min max differences 
across a group of signals that have a common source and destination. For example, the SysAddOut bus is 
accompanied by a forwarded clock and collectively they are targeted at one destination. The following 
diagram illustrates the contributors that minimize the Bit time. 
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From the diagram, one can derive the following equations: 


a) Min_BitRate = FwaClkPeriod 


b) Min_BitRate = UncorrelatedSkew + FwdDestSetup + FwdDestHold 


c) UncorrelatedSkew = (FwdDataMax - FwdDataMin) + (FwdClkMax - FwdCIkMin) + 
(FwdClkOffsetMax - FwdCikOffsetMin) 


There are a number of contributors to Uncorrelated skew. They are as follows: 


e Delay variations dependent on previous history of data transitions on an individual bit line. 

e Simultaneous switching of outputs causing clock and data pad cells to experience delay that depend 
on the switching patterns of nearby neighbors. 

e Crosstalk between signal lines couple into adjacent signal nets causes signals to move up and-down 
from their crosstalk-free positions. 

e Differences between the desired length of the clock lines and the data lines on both the module and 
package 

e Differences in impedance along path of clock and data lines. 

Differences in the propagation velocity of clock and data due etch runs on different signal layers. 

Differences in propagation delays of clock and data due to different cell types used in the two paths on 

both the source and target chip. 

Differences in termination techniques between the clock and data. 

Differences in loading of the clock and data networks at either the source or target chips. 

Differences in clocking times of different cells due to RC delays. 

Intra-die process variations 


To determine how much one must delay the Forward clock relative to the forwarded data (FwdClkOffset), 
use the following equations: 
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d) FwdCtkOffset > (FwdDataMax + FwdDestSetup - FwdCikMin) 


e)FwdClkOffset < (FwdDataMin + FwdDataPeriod - FwdCikMax - FwdDestHold) 


3.3.12.7 Power Down Mode 


EV6 is designed to operate in computer systems that meet all the criteria specified in the EPA Energy 
Star worksheet. EV6 can automatically enter a low-power or “sleep” mode that enables the system to be 
30 watts or less. EV6 will automatically “wake up” upon resumption of system activity and return to the 
same situation that existed prior to entering sleep mode. 


1. 


EV6 will enter sleep mode by way of CALL_PAL WTINT. The following sequence of events will 
occur during the execution of this instruction. 


EV6 writes a value into the TBD CSR external to the system which is the number of interval timer 
interrupts that the system ignores until threshold. . 

EV6 interval caches are swept. Clean block are invalidated(Evict commands are issued on systems 
with Duplicate Tag stores) and dirty caches blocks are written back to main memory. 

EV6 then saves all architectural and readable state that would be needed upon returning to the wake 
State. 

EV6 sets an IDLE in the interface that alerts the system that he interface is inactive. The system will 
‘ACK’ this command when all outstanding probes have been serviced. The system will send no more 
probes until the IDLE bit is clear. 

The routine now writes to‘an internal IPR that sets the divisor on the PLL output so that the GCLK ” 
now runs at less than 1/10" the nominal clock rate. 

Upon receiving either an interval timer interrupt or a device interrupt (if enabled), EV6 will doa 
limited chip reset. That is, reset all but the configuration registers. 

When nominal clock rate is achieved, EV6 receives a ClockFwdReset to reset its own clock forward 
circuitry as well as the systems. 

EV6 will then read the external interval timer threshold register to determine the type of interrupt and — 
to update the memory resident time-of-year clock. EV6 will also clear the IDLE bit in the system. 


10. EV6 will now restore the processor state and return to normal operation. 
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3.3.12.8 PLL Bypass 


EV6 testing requirements include the ability to bypass the internal PLL. An input pin known as 
PLLBypass, when asserted will apply the CLKIN_H/L frequency directly to the internal GCLK 
distribution. For nominal 500 MHz operation a 500 MHz sinusoidal differential signal must be applied 
to the CLKIN_H/L pins. 


Additionally, EV6 can be operated in a system with PLLBypass asserted. Systems providing a frequency 
directly to EV6 in bypass mode will either provide an external PLL that performs that phase alignment to 
EV6CLK_H/L. An alternative would be to absorb the delay from CLKIN to EV6CLK as skew. The max 
delay from CLKIN to EV6CLK is TBD. 


3.3.12.9 INITIALIZATION 


This section will describe those features of the EV6 clock forwarding interface that are programmable. 


1. Address and Data bus bit time : 
A 3 bit field in the CBOX IPR that defines the forwarding clock period of the SysData bus and the 
SysAddIn/Out Bus as a multiple of the EV6 GCLK. The table shows the clock multiple which is 
selected and the associated bit times for each of the 6 possible settings. The IPR is called 
sysclk_ratio. a ee 


| Sysclk_ratio<2:0> . Multiple Bit Time 


Ses EV6 @2nsec 
001 1.5 3nsec 
010 2.0 Ansec 
011 2.5 Snsec 
100 3.0 6nsec 
101 3.5 7nsec 
110 4.0 8nsec 


2) Address and Data Receive Mux Counter Preset . 
A fieid that is the preset value of the receive mux counter after deassertion of the synchronous clock ~*~ ~: 
forward reset. The preset value is chosen by careful analysis of the Max_delay of the forwarded clock and 
the earliest possible point to select the appropriate receive flop. The CBOX IPR is called 
sys_rcv_cnt_preset<1:0>. . 





| sys_rcv_cnt_preset<1:0> Counter Preset 


00 00 
01 01 
10 10 
11 11 
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3) Y Divisor value select field 

A three bit field used to select one of 8 Y divisor values in the PLL return loop. The Y divisor divides 
down the distributed GClk from the nominal 500 Mhz to match the CLKIN_H/L frequency. The Y 
divisor value is set during reset by copying the static held levels on IRQ<2:0> into a clock control register. 


IRQ<2:0>___Y divisor value 
000 


001 
010 
011 
100 
101 


O~INAMKH WwW 





4) Framing Clock Offset 


A 2 bit field that changes the position of the framing clock relative to the framing clock seen at the input 
pins. It allows systems to adjust the start of EV6 generated commands and data so that it is guaranteed to 
be valid at the earliest system clock edge. This will help in reducing latency. Each bit equals one forward 
clock period of adjustment earlier than the nominal frame clock. This CBOX IPR is called _ 
fram_clk_offset<1:0>. 


fram_clk_offset<1:0> | # Forward Clock Periods 


Seen nccnecccnne esas eeecen SOCCER Eee es DAES ECOES STOLE CORES REC e SERS eaten ee eSet ester Eee tee E eet TEESE ESE Ste EsEEeeNetetee: 


00 0 (nominal) 
01 1 
10 2 
11 3 


3.3.12.10 Clock Forward Reset 


Systems are required to generate Clock Forward Reset (ClkFwdReset_H). This signal must occur no | 
earlier than TDB cycles after the deassertion of reset_L and TBD cycles after the interval timer interrupt 
is sent to wake up a powered down EV6. CikFwdReset is a synchronous signal and is clocked into a 
register 

in EV6 with the captured copy of Framing Clock. Systems must insure that the with +/-2.0 nsec of skew 
and set up time of ***psec, EV6 can safely capture the assertion of ClkFwdReset. 


There is a one (framing clock) cycle of internal distribution delay on ClkFwdReset so that on the second 
rising egde of ClkFwdReset, it is applied to the target circuit. The forwarded clocks are disabled both 
at the system and within EV6. The receive counter is set to 0 and the sys_rcv_mux_cnt_preset<1:0> is 
applied to the Unload counter in EV6. ClkFwdReset should assert for a minimum of 3 framing clock 
cycles. The synchronous deassertion of ClkFwdReset will start the forwarded clocks. Clocks remain 

_ on and free running. 
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The diagram below shows the application of ClkFwdReset. Note that it is asserted for only two framing 
clock periods and the minimum is three. 
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3.4 Beache Port 
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EV6 supports a second level cache from 1 to 16 MB in size, with 64-byte blocks. A 128-bit bus is used for 
data transfers between EV6 and the Bcache. The Bcache is fully synchronous, and the SRAMs must 
contains either one, two or three internal registers. All Bcache control and address pins are clocked 
synchronously on Bcache cycle boundaries. The Bcache clock rate can vary from 1.5 to 4 CPU clock 


cycles, in half cycle increments. 


3.4.1 Becache Port Pins 





Pin Name Type Count 
BcAddress<23:4>_H or 20 _ Bcache Index 
BcDataOE<1:0>_H O 1 __—Bcache data output enable 
BcBurst_H O 1 Bcache burst enable for burst mode 
SRAM’s 

BcDataWR_H O 1 Bcache data write enable 
BcData<127:0>_H B 128 Bcache data 
BcCheck<15:0> B 16 ECC check bits for BcData 
BcDataClkIn<7:0>_H I 8 optional Bcache data input clocks 
BcDataClkIn<7:0>_L si 8 optional Bcache data input clocks 
BcDataClkOut<3:0>_H O 4 Bcache data clock outputs 
BcDataClkOut<3:0>_L O “4 Bcache data clock outputs 
BcTag<42:20>_H B 23 
BcTagValid_H B 1 
BcTagDirty_H B 1 
BcTagShared_H B 1 
BcTagParity_H B 1 
BcTagClkOut_H/L O 2 
BcTagClkIn_H/L I 2 ‘optional BcTag input clocks 
BcTagOE_H O 1‘ tag ram output enable 

O 1 


BcTagWR_H 


Ee had 


3.4.2 Pin Descriptions 


3.4.2.1 BcAddress 


tag ram write enable 





BcAddress is a high drive output and supplies the index for the Bcache. EV6 supports the following 
Bcache sizes : 0,1MB,2MB,4MB,8MB, and 16MB. 


3.4.2.2 BcClkOut 


BcClkOut<3:0> are differential copies of the Bcache clock. BcClkOut may be configured such that its 
rising edge lags BcAddress by 0 to 2 CPU clock cycles. The BcClkOut is free-running and is derived 
from the internal GCLK. It’s period is a multiple of the GCLK and is fixed for all operations. 


EV6 supports only Synchronous SRAMS. Those Synchronous SRAMs can be from 3 different families. 
The first is a BurstRam with conventional Reg/Reg output and that is one piece of data for every rise of 
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the clock. The second type is a non-burst REG/REG Late Write architecture. And finally, the third type is 
a BurstRam Reg/Reg Late Write with clock forwarded output with data on the rise and fall of the clock. 


3.4.2.3 BcBurst 


BcBurst is asserted on the first cycle of a read or write when BC_BURST is set in the **IPR. Tag stores 
are not bursted and can be accessed under a burst of the data BurstRam. 


3.4.2.4 BcDataClkIn & BcTagClkIn 


The BcDataClkin and BcTagClikIn pins are to be used with high speed DDRs that provide a clock out 
with the data output pins to optimize Bcache read bandwidth. EV6 will internally sync the data to the its 
CPU with clock forward receive circuitry similar to the System interface. For non DDR devices systems 
will connect the . 


3.4.3 Bcache Banking 


Bcache banking is possible by the decode of the most significant address bit. Switching between cache 
banks may require Rd-Rd bubbles as well as the usual Rd-Wr bubbles; this will be programmed via the 
BC_RR_BUB field of the **IPR. 


3.4.4 Bcache Transactions 
The Bcache supports 4 transactions: 


Data Read 
Data Write 
Tag Read 
Tag Write 


Data reads are always accompanied by tag reads in the first cycle of the data read. Similarly data writes 
include a tag write in the first cycle so the Bcache tag state can reflect any changes made to the block 
while it was in the Dcache. Tag reads and writes are used individually as the result of System PROBE 
commands. Tag reads will also be performed during a 4 cycle burst of the Data SRAMS. This allows 


system probes access to the Bcache Tag store without interrupting that private access by the processor. 


EV6 supports late write SRAMs - write data can be delayed from the address by 0 to 4 Bcache clocks. 


3.4.5 Bcache Clocking 


BcClkOut is used to synchronously clock address, control and data into and out of the Bcache SRAMs. 
BcCikOut and BcClkIn are free running clocks and they are derived from the internal processor clock The 
period and position of the edges of the clock is determined by setting appropriate fields in the CBOX IPS. 


The period of the Bcache clock is established by setting the bcclk_ratio CBOX IPR. The ratio is a multiple 
of the processor clock and can range from 1.5 to 4.0 in .5 increments. This setting essentially defines the 
period of the data bit. In single data mode, the clock ratio established the period of the Bcache clock and 
in dual data mode the bclk_ratio defines one phase of the bcclk_ratio. Dual data SRAMs provide data on 
the rise and fall of a clock that accompanies the data back to EV6. Systems must enable dual data mode 
by asserting the bc_ddr_enable in the CBOX IPR. 


The position of the clock relative to address and write data can be controlled down to a processor clock 
phase by setting the appropriate value in the 3 following registers: 
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1. BC_LATE_WRITE_NUM delays write data relative to the address by one becache clock period for 
each binary value in the register from 0 to 7 Bcache clocks. This is useful for late write SRAMs. 


2. BC_CPU_LATE_WRITE_NUM delays write data relative to address by one CPU clock period for 
each binary value set in the register from 0 to 3 CPU cycles. The rising clock edge moves with write 
data. 


3. BC_PHASE_LATE_WRITE_NUM delays the rising edge of the Bcache clock by 0-2 CPU clock 
phases. For a2 nsec CPU clock, this allows for 1 nsec granularity setting clock the position of the 
clock. 
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3.4.5.1 Dual Data SRAM Read 


10ns en fate 25ns 
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Addr@SRAM YADDA OX >A > 


BCCIk@Sram — sf F—s Of D*"TN Of F*TLL Of” DCS 


Leto TO OUT DATA 
Data@SRAM & DO. Xk D1 XXD2 *kp3 ) 


Liginetch 


DataBus@Ev6 Gx DO XX D1 Xx D2 Xen D3 3 
=choCIk@EV6 /\ Off F*FTe fT 


The timing diagram above show a read from a bursting dual data SRAM that forwards data on the rising 
and falling edges of an echo clock. The echo clock is a reflected copy of the SRAM input clock provided 
by EV6. It is properly aligned with the data output so that it meets setup and hold requirements of the 
receive circuitry in EV6. The SRAM does a burst of 4 in interleaved burst order. Control signals not 
shown initiate the burst, control read and write and the direction of the output drivers. 


3.4.5.1 Standard SRAM Read 


ee pe 


| | | | | | | | 
Addr@SRAM —\—~AQ—X--Ai_X_A2X_ ASX XXX BBX BS) 


3cCIk@SRAM QV PNP NTN SNS NT NT 
RD/WR# 






Write Data to Sram 
Data@SRAM (DBi_) 


The timing diagram above illustrates a late write SRAM that is non-bursting in single data mode. The 
ram provides one piece of data for every rise of the BcClk. For reads, the address is clocked into the part 
in cycle ‘n’ and the data appears at the pins on clock ‘n + 1’. Each piece of data requires a new address. 
The diagram shows the transition to a write, beginning with address BO and the fall of the RD/WR# 
control signal. Data relative to BO address is clocked into the part on the next rising edge of the BcClk. 
Note that there are two bubbles in the address when transitioning from a read to a write. There are no 
bubbles when going from a write to a read. 
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3.5 Interrupts 


The System may request interrupts via the irq_h<5:0> pins. These six interrupt sources are identical: they 
may be asynchronous, are level sensitive and can be individually masked via the EIE field of the IER IPR. 
The way these signals are used and their relative priority is completely general, and left to the System 
designer. 


3.6 Pin List 
Name Type Count 
SysAddIn<14:0>_L 
SysAddInClk_L 
SysFillValid_L 
SysAddOut<14:0>_L 

SysAddOutClk_L 





SysData<63:0>_L 
SysCheck<7:0>_L 
SysDataInClk<7:0>_L 
SysDataOutClk<7:0>_L 
SysDataInValid_L 
SysDataOutValid_L 
TOTAL 123 


BcAddress<23:4>_H 
BcDataOE_L 
BcLoad_L 
BcDataWR_L 


BcData<127:0>_H 
BcCheck<15:0>_H 
BcDatalInClk<7:0>_H 
BcDataInClk<7:0>_L 
BcDataOutClk<3:0>_H ~ 
BcDataOutClk<3:0>_L 


— 
& OO COON 


BcTag<42:20>_H 
BcTagValid_H 
BcTagDirty_H 
BcTagShared_H 
BcTagParity_H 
BcTagOE_L 
BcTagWR_L 
BcTagInClk_H/L 
BcTagOutClk_H/L 
TOTAL 


On OOwtCwWDWD OOKK WDD OCO0O 


» nwo 
ND & & Se eS = = OD 


N 


IRQ<5:0>_H I 


ON 


RESET_L I 


h— 
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TestModeSelect_H 
TestClk_H 
TestReset_L 
TestDatain_H 
TestDataOut_H 
TestStatus_H 


3 

5 

em 
OOH HHHOOWe 


ee en ee ee ee ee 


Clkin_H 

ClkIn _L 
FrameClk_H 
PllBypass_H 
ClkFwdReset_H 
EV6CLK_H 
EV6CLK_L 
PLLVDD 


ne ee 


DCOk_H 
VRefBcache 
VRefSys 


— bd 
jd eh eed 


TOTAL 
GRAND TOTAL 374 


i) 
| 
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4. Privileged Architecture Library Code 
This chapter describes the EV6 PALcode environment. 


4.1 Use of Alpha Implementation-Specific Opcodes 


The Alpha architecture reserves five opcode points for implementation-specific PALcode use. The table 
below lists these opcodes and their use in EV6. 


EV6 Mnemonic 0dei6 Function 

HW_LD 1B D-stream load instruction 

HW_ST 1F D-stream store instruction 

HW_RET . IE Return from PALcode routine 

HW_MFPR 19 Reads the value of an IPR into a integer GPR 


HW_MTPR 1D Writes the value of an integer GPR into an IPR 





These instructions generally produce an OPCDEC exception if executed while the processor is not in 
PALmode, however, if 1 CTL<HWE> is set these instructions can be also be executed in kernel mode. 
Software which uses these instructions must adhere to the PALcode restrictions listed in this chapter. 
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4.1.1 HW_LD Instruction 


PALcode uses the HW_LD instruction to access memory outside the realm of normal Alpha memory 
management and to do special forms of D-stream loads. Data alignment traps are disabled for the HW_LD 
instruction. 


31 26 25 21 20 1615 131211 (0) 





Opcode 1Bi6 The Opcode value: 1Bi¢ 
Ra Destination register number 
Rb Base register for memory address 
Type 000, Physical 
The effective address for the HW_LD is physical 
001. ‘  Physical/Lock 


The effective address for the HW_LD is physical. Load lock 
version of HW_LD. 

010, Virtual/VPTE 
Flags a virtual PTE fetch (L.D_VPTE). Used by trap logic to 
distinguish single TB miss from double TB miss. Kernel 
mode access checks are performed 

100, Virtual 
The effective address for the HW_LD is virtual. 

101, Virtual/WrChk 
The effective address for the HW_LD is virtual. Access 
checks for FOR, FOW,, read and write protection. 

110, Virtual/Alt 
The effective address for the HW_LD is virtual. Access 
checks use DTB_ALT_MODE IPR 

111, Virtual/WrChk/Alt 
The effective address for the HW_LD is virtual. Access 
checks for FOR, FOW, read and write protection. Access 
checks use DTB_ALT_MODE IPR 


Len 0 Access length is longword 
1 Access length is quadword 
Disp Holds a 12-bit signed byte displacement 





4.1.2 HW_ST Instruction 


PALcode uses the HW_ST instruction to access memory outside the realm of normal Alpha memory 
management and to do special forms of D-stream store instructions. Data alignment traps are inhibited for 
HW_ST instructions. 
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31 26 25 27 20 1615 13 12 11 0 








Field Value Description 

Opcode 1Fi¢ The Opcode value: 1Fi6 

Ra Write data register number 

Rb Base register for memory address 


Type 000, Physical ; 
The effective address for the HW_ST is physical 

001, Physical/Cond 
The effective address for the HW_ST is physical. Store 
conditional version of HW_ST. The lock flag is returned in 
Ra. Refer to PAL restrictions for correct use of this 
function. 

010, Virtual 
The effective address for the HW_ST is virtual. 

110, Virtual/Alt 

The effective address for the HW_ST is virtual. Access 

checks use DTB_ALT_MODE IPR 


all others Unused 
Len 0 Access length is longword 
1 . Access length is quadword 
Disp Holds a 12-bit signed byte displacement 
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4.1.3 HW_RET Instruction 


The HW_RET instruction is used to return instruction flow to a specified PC. The Rb field of the 
HW_RET instruction specifies an integer GPR which holds the new value of the PC. Bit <0> of this 
register provides the new value of PALmode after the HW_RET instruction is executed. Bits <15:14> of 
the instruction contain the stack action. Normally the HW_RET succeeds a CALL_PAL instruction or 
trap handler, which pushed its PC onto the prediction stack. In this mode, the HINT should be set to “10” 
to pop the PC and generate a predicted target address for the HW_RET. In certain circumstances, the 
HW_RET is used in the middle of a PAL flow to cause a group of instructions to retire. In these cases, if 
the HW_RET does not have a corresponding instruction which pushed a PC onto the stack, the HINT field 
should be set to ‘00’ to keep the stack from being modified. In the rare circumstance that the HW_RET 
might be used like a JSR or JSR_COROUTINE, the stack can be ee by setting the HINT bits 
accordingly. 


31 26 25 21 20 16 15 1413 12 0 


STALL 
HINT 











Field Value Description 
Opcode JEi6 The Opcode value: 1Ej¢ 
Ra Register number. Should be R31. 
Rb Target PC of HW_RET. Bit<0> of the register’s contents determines: 
the new value of PALmode. 
Hint 00 HW_JMP: PC is not pushed onto prediction stack; no predicted target 
01 HW_JSR: PC is pushed onto prediction stack; no predicted target 
10 HW_RET: prediction is popped off stack and used as target 
11 HW_COROUTINE: prediction is popped and used as target. PC is 
pushed onto stack 
Stall If set, the fetcher is stalled until the HW_RET is retired or aborted. 


EV6 will force a mispredict, kill instructions which were fetched 
beyond the HW_RET, refetch the target of the HW:-RET and stall 
until the HW_RET is retired or aborted. Note that if instructions 
beyond the HW_RET have issued out-of-order they will be killed and 
refetched. 


4.1.4 HW_MFPR and HW_MTPR Instructions 


The HW_MFPR and HW_MTPR instructions are used to access internal processor registers. The 
HW_MEFPR instruction reads the value from the specified IPR into the integer register specified by the Ra 
field of the instruction. The HW_MTPR instruction writes the value from the integer GPR specified by the 
Rb field of the instruction into the specified IPR. _ 
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INDEX 








Field Value Description 
Opcode 1916 The Opcode value for HW_MFPR: 19; 
1Di6 _ The Opcode value for HW_MTPR: 1D. 
Ra Destination register for HW_MFPR. Should be R31 for HW_MTPR. 
Rb Source register for HW_MTPR. Should be R31 for HW_MFPR. 
INDEX IPR index. 
SCBD MASK Specifies which IPR scoreboard bits in the IQ are to be applied to this | 


instruction. A set mask bit indicates that the corresponding IPR 
scoreboard bit should be applied to this instruction. 
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4.2 Internal Processor Register Access Mechanisms 


Since the EV6 Ibox reorders instructions and executes instructions speculatively, extra hardware is 
required to provide software with the correct view of architecturally defined state. The Alpha architecture 
defines two classes of state - general-purpose registers and memory. Register renaming is used to provide 
architecturally correct register file behavior, while the Ibox and Mbox each have hardware dedicated to 
invisibly providing correct memory behavior to the programmer. Since the internal processor registers are 
implementation-specific state not defined by the Alpha architecture, access mechanisms for these registers 
may be defined which impose restrictions and limitations on the software which uses them. This section 
describes the hardware and software access mechanisms which are used for EV6’s IPRs. 


With respect to a particular IPR, each instruction type can be classified by how it affects and is affected by 
the value held by that IPR. Explicit readers are 1W_MFPR instructions which explicitly read the value 
of the IPR. Implicit readers are instructions whose behavior is affected by the value of the IPR. For 
example, each load instruction is an implicit reader of the DTB. Explicit writers are HW_MTPR 
instructions which explicitly write a value into the IPR. Implicit writers are instructions which may write 
a value into the IPR as a side effect of execution. For example, a load instruction which generates an 
access violation is an implicit writer of the VA, MM_STAT, and EXC_ADDR IPRs. In EV6, only 
instructions which generate an exception will act as implicit IPR writers. Only certain IPRs, such as write- 
one-to-clear bits are both implicitly and explicitly written. The read-write semantics of these IPRs is 
controlled by software. 


4.2.1. IPR Scoreboard Bits 


In previous Alpha implementations, IPR registers were not scoreboarded in hardware, and software was 
required to schedule HW_MTPR and HW_MFPR instructions for each machine’s pipeline organization in 
order to ensure correct behavior. This software scheduling task is more difficult in EV6 since the Tbox 
performs dynamic scheduling. Hence eight extra scoreboard bits are used within the IQ to help maintain 
correct IPR access order. The HW_MTPR and HW_MFPR instruction formats contain an eight-bit field 
which is used as an IPR scoreboard bit mask to specify which of the eight IPR scoreboard bits are to be 
applied to the instruction. 


For HW_MTPR, if any of the unmasked scoreboard bits are set when the instruction is about to enter the 
IQ, then the instruction (and those behind it) is stalled outside the IQ until all the unmasked scoreboard 
bits are clear and the queue does not contain any implicit or explicit readers which were dependent on 
those bits when they entered the queue. When all the unmasked scoreboard bits are clear and the queue 
does not contain any of those readers, the instruction enters the IQ, and the unmasked scoreboard bits are 
set. 


HW_MFPR instructions are stalled in the IQ until all their unmasked IPR scoreboard bits are clear. 


Scoreboard bits <3:0> and <7:4> behave differently in regard to their effect on other instructions when 
set, and in regard to how they are cleared. 


If any of scoreboard bits <3:0> are set when a load or store instruction enters the IQ, then that load or 
store will not issue from the IQ until those scoreboard bits are clear. 


Scoreboard bits <3:0> are cleared when the HW_MTPR instructions which set them issue (or are aborted). 
Bits <7:4> are cleared when the HW_MTPR instructions which set them retire (or are aborted). 


Bits <3:0> are used for the DTB_TAG and DTB_PTE register pairs within the DTB fill flows. These bits 


can be used to order writes to the DTB with respect to loads and stores. See sections 4.6.1 and 5.3.1. The 
assignment of IPRs to scoreboard bits is given in the next chapter. 
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Bit <O> is used in both DTB and ITB fill flows to trigger, in hardware, a light-weight memory barrier 
(TB-MB) to be inserted between a Id_vpte and the corresponding virtual-mode load which TB-missed. 


4.2.2 Hardware Structure of Explicitly Written IPRs 


IPRs which are written by software are physically implemented as two registers. When the HW_MTPR 
instruction which writes the IPR executes it writes its value to the first register. When the HW_MTPR 
instruction retires the contents of the first register are written into the second register. Instructions which 
either implicitly or explicitly read the value of the IPR do so from the second register. Read-after-write 
and write-after-write dependencies are managed using the IPR scoreboard bits. Write-after-read conflicts 
are avoided: the second register is not written until the writer retires, the writer won’t retire before the 
previous reader retires, and the reader retires after it has read its value from the second register. 


Some groups of IPRs are built using a single shared “first” register. To prevent write-after-write conflicts, 
IPRs which share a “first” register also share scoreboard bits. 


4.2.3 Hardware Structure of Implicitly Written IPRs 


Implicitly written IPRs are physically built using only a single level of register, however the IPR has two 

hardware states associated with it: 

1. Default State: The contents of the register may be written when an instruction generates an exception. 
If an exception occurs, write a new value into the IPR and go to state 2. 

2. Locked State: The contents of the register may only be overwritten by an excepting instruction which 
is older than the instruction associated with the contents of the IPR. If such an exception occurs, 
overwrite the value of thc IPR. When the triggering instruction, or instruction which is older than the 
triggering instruction, is killed by the Ibox, go to state 1. 
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4.2.4 IPR Access Ordering 


| IPR access mechanisms must allow values to be passed through each IPR from a producer to its intended 
consumers. The table below exhaustively list all the pair-wise instruction fetch orderings between 
instructions of the four IPR access types, specifies whether access order must be maintained, and if so, the 
mechanisms used to ensure correct ordering. 


fe feet eee First Instruction 
Second Instr. Implicit Reader Explicit Reader 


Implicit Reader Reads can be No IPRs in this Reads can be Scoreboard bits 
class reordered 


reordered stall issue of 
reader until writer 
Implicit Writer 
Explicit Reader 


retires, or 
Explicit Writer 









































HW_RET/STALL 
is used to stall 
reader 



















No IPRs in this The hardware IPR-specific No IPRs in this 
class structure of PALcode class 

implicitly written | restrictions are 

IPRs handles this | required for this 


case. For 
example, reads 
might be required 
to be placed in 
certain locations in 
a PAL flow. 


case. 

















Reads can be 
reordered 


Scoreboard bits 
Stall issue of 

reader until writer 
is retired. 


If the reader is in 
the PALcode 
routine invoked by 
the exception 
associated with the 
writer, then 





































Scoreboard bits 
stall second writer |. 
in map stage until - 
first writer retires. 


Reader reads 
second latch. 

Writer can’t write 
second latch until 
it retires 


Write-one-to-clear 
bits,or 
performance 
counter special 
case. For 
example, 
performance 
counter increments 
are typically not 
scoreboarded 
against reads. 





Reader reads 
second latch. 
Writer can’t write 
second latch until 
it retires. 



















4.2.5 IPRs and HW_RET Stalls 


| In some cases, correct ordering of an explicit write to an IPR followed by and implicit read of the IPR is 
guaranteed using the IPR scoreboard bits. However, if the instruction which implicitly reads the IPR does 
so before the issue stage of the pipeline then this method does not work. For example, modification of the 
ITB affects instructions before the issue stage of the pipeline. For this case PALcode must contain a 
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HW_RET instruction with its stall bit set before any instruction which implicitly reads the IPR(s) in 
question. This prevents instructions which are newer than the HW_RET from being successfully fetched, 
issued, and retired until after the HW_RET instruction is retired (or aborted). 


4.3 PAL Shadow Registers 


EV6 contains extra virtual integer registers which are available to PALcode for use as scratch space and 
storage for commonly used values. These registers are made available under the control the SDE<1:0> 
field of the I_CTL IPR. 


Any PALcode which supports CALL_PAL instructions must leave one of SDE<1:0> set when the 
processor is native mode, since hardware writes a shadow PAL register with the return address of 
CALL_PAL instructions. See section 4.5.1. 


4.4 PALcode Emulation of FPCR 


The FPCR register contains two classes of bits, status and control, which are accessed via the MT_FPCR 

and MF_FPCR instructions. The register is physically implemented like an explicitly written IPR. It may 

be written with a value from the floating point register file via the MT_FPCR instruction. Architecturally 

compliant FPCR behavior requires PALcode assistance. There are three behaviors of the FPCR register 

which must be considered: 

1. Correct operation of the status bits, which must be set when a floating point instruction encounters an 
exceptional condition, independent of whether a trap for the condition is enabled. 

2. Correct values when read via the MF_FPCR instruction. 

3. Correct behavior when written via the MT_FPCR instruction. 


4.4.1 Status Flags 


The FPCR status bits in EV6 are set with PALcode assistance. Floating point exceptions for which the 
associated FPCR status bit is clear, or for which the associated trap is enabled, result in a hardware trap to 


the ARITH PALcode routine. The EXC_SUM register contains information to allow this routine to update 


the FPCR appropriately, and to decide whether to report the exception to the operating system. 


4.4.2 MF_FPCR 


The MF_FPCR is issued from the floating point queue and executed by the Fbox. No PALcode assistance 
is required. 


4.4.3 MT_FPCR | 


The MT_FPCR i instruction iS issued from the floating point queue. This instruction is implemented as an 


explicit IPR write: the value is written into the “first” latch, and when the instruction retires the value is 
written into the “second” latch. There is no IPR scoreboarding mechanism in the floating point queue, 


however, so PALcode assistance is required to ensure that subsequent readers of the FPCR get the updated 


value. 


Subsequent to writing the “first latch,” the MT_FPCR instruction invokes a synchronous trap to the 
MT_FPCR PALcode entry point. The PALcode can simply return using a HW_RET instruction with its 
STALL bit set. This sequence ensures that the MT_FPCR instruction will be correctly ordered with 
respect to subsequent readers of the FPCR. 


4.5 PALcode Entry Points 
PALcode is invoked at specific entry points, of which there are two classes: CALL_PAL and exceptions. 
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4.5.1 CALL_PAL entry 


CALL_PAL entry points are used whenever the Ibox encounters a CALL_PAL instruction in the 
instruction stream. In order to speed the processing of CALL_PAL instructions, they do not invoke 
pipeline aborts, but are processed as normal jumps to the offset from the contents of the PAL_BASE 
register which is specified by the CALL_PAL’s function field. The IBOX fetches a CALL_PAL 
instruction, bubbles one cycle, and then fetches the instructions at the CALL_PAL entry point. For 
convenience of implementation, returns from CALL_PAL are aided by a linkage register (much like 
JSR’s). A PAL shadow register is used as the linkage register - the Ibox loads it with the PC of the 
instruction after the CALL_PAL instruction. Bit <0> of the linkage register is set if the CALL_PAL was 
executed while the processor was in PAL mode. If I.CTL<NT_MODE> is clear then PAL shadow R27 is 
the linkage register, otherwise PAL shadow R23 is used. The Ibox also pushes the value of the return PC 
onto the return prediction stack. CALL_PAL instructions start at the following offsets: 


e Privileged CALL_PAL instructions start at offset 200016 
e Nonprivileged CALL_PAL instructions start at offset 3000. 


Each CALL_PAL instruction includes a function field which is used to calculate the PC of the its 
associated PALcode entry point. The PALcode OPCDEC flow will be invoked if the CALL_PAL function 
field is: 


e in the range of 40j¢ to 7Fi¢ inclusive, or 
e greater than BF jg, or 
e Between 006 and 3Fi¢, inclusive, and PS<CUR_MODES> is not equal to kernel 


If none of the above conditions are met, then the PALcode entry point PC is as follows: 


PC<64:15> = PAL_BASE<63:15> 
PC<14>=0 

PC<13>= 1 

PC<12> = CALL _PAL function field <7> 
PC<11:6> = CALL_PAL function field <5:0> 
PC<5:1>=0 

PC<0Q> = 1 (PALmode) 


4.5.2 PALcode Exception Entry Points 


When hardware encounters an exception the Ibox jumps to a FALcode entry point at a PC determined by 
the type of exception, and writes the PC of the instruction which triggered the exception into the 
EXC_ADDR register and onto the top of the return prediction stack. 


The table below shows the PALcode exception entry points and their offset from the PAL_BASE IPR The 
entry points are listed in decreasing order of priority. 





RY NAME eee EYRE eetnennnsennne PESOS ce OSD NON tte astartdaract ence micas, 

DTBM_DOUBLE_3 Fault 100 D-stream TB miss on virtual page table entry 
fetch. Use three-level flow 

DTBM_DOUBLE_4 Fault 180 D-stream TB miss on virtual page table entry 
fetch. Use four-level flow. 

FEN Fault 200 Floating point disabled 

UNALIGN Fault 280 D-stream unaligned reference 

DTBM_SINGLE Fault 300 D-stream TB miss 

DFAULT Fault 380 D-stream fault or virtual address sign check 
error 
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OPCDEC Fault 400 INegal opcode or function field: 
e opcode 1, 2,3,4,5,60r7 
e opcode 19:6, 1Bi6, 1Di6, 1Ei¢ or 1Fi¢ , 
not PAL mode or not I CTL<HWE> 
e extended precision IEEE format 
e unimplemented function field of opcodes 


1446 or 1Ci¢6 
IACV Fault 480 I-stream access violation or virtual address 
sign check error 
MCHK Interrupt 500 Machine Check 
ITB_MISS Fault 580 : I-stream TB miss 
ARITH Synch. Trap 600 Arithmetic exception or update to FPCR 
INTERRUPT Interrupt 680 Interrupts: hardware, software and AST 
MT_FPCR- Synch. Trap 700 Invoked when a MT_FPCR instruction is 
issued. 


RESET/WAKEUP Interrupt 780 Chip reset or wakeup from sleep mode 





4.6 TB Fill Flows 


This section shows the expected PALcode flows for DTB miss and ITB miss. Familiarity with EV6’s IPRs 
is assumed. See chapter 5. 


4.6.1 DTB Fill 


The single-miss DTB flow is shown below: 


instruction Ebox subciuster Issue Cycle 
mf  27,exc_addr OL 1 

mf 18,va_form,+<7:4> 1L 1 

mf  r9,mm_stat OL 2 

mf = ri1,exc_sum OL 3 

Id_vpte r8,(r8) iL 2 

sri 125, #PHYS,r10 = xU 2 

blbs 10,1 to 1 map xXU 3 

mf = r10,Va,+<7:4> IL 3 

bloc r8,invalid_pte XU 5 

mt = r10,dtb_tagO OL 4 

mt  10,dtb_tag1 AL 4 

mt _——_r8,dtb_pted OL 5 

mt ——-r8,dtb_ pte iL 5 

hw_ret (r27) OL 6 

LD or ST (restart) 7 (14 with TB-MB) 


Here are some notes with respect to this flow: 

18, r9,r10, & r27 are PAL shadows. 

The arcs show issue order dependencies that are not related to register data. 

IPR scoreboard bits <3:0> are used to order the restarted load or store with respect to the DTB writes. 

MM_STAT and VA will not be overwritten if the LD_VPTE instruction misses the DTB - there is no 

issue order constraint here. 

e The code is written to prevent a later execution of the DTB fill from issuing ahead of a previous 
execution and corrupting the previous write to the TB registers. This is accomplished by placing code 
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dependencies on scoreboard bits <7:4> in the path of the successive writers. This keeps the 
successive writers from issuing ahead of the retiring of the previous writers. 

The issue of MTPR DTB_PTEO triggers, in hardware, a light-weight memory barrier (TB-MB) which 
enforces read-ordering of stores from another processor (I) to this processor’s (J) page table and this 
processor’s virtual memory area such that if this processor sees the write to the PTE from (1) it will 
see the new data.: 


Processor I Processor J 
Wr Data LD/ST 
MB <tb miss> 
Wr PTE LD-PTE, write TB 
LD/ST 


The conditional branch is placed in the code so that all of the MTPR’s issue and retire or none of 
them issue and retire. This allows the TB fill hardware to update the TB whenever it sees the retiring 
of PTE] and to ignore writes to TAGO/TAGI/PTEO/PTE1 in the interim between the issuing of those 
writes and a retire of PTE]. 


4.6.2 ITB Fill 
The ITB miss flow is shown below: 


Instruction Ebox subcluster issue Cycle 

mf  8,iva_form OL 1 

mf § 127,exc_addr OL x ; 

Id_vpte r8,(r8) xL 4 ;get PTE 

Ida 9, OXOfff xU 2 ; create mask for prot 
and = 18,19,19 x 7 ; get prot bits 

srl r25,#PHYS,r10 xU x 

bibs r10,1_to_1 xU x 

si 8,#19,r10 xU 7 

sll r10 #PTE_PFN,r10 xU 8 ; put PFN in place 
and 8,#foe_bit,r11 xL 8 ; get FOE bit 

bibe r8,invalid_pte xU x 

bne _rt1,foe_pte xU x 

bis r9,r10,r10 xL 9 ; PTE in ITB format 
mt  £27,itb_tag OL 6 

mt = r10,itb_ pte OL 10 

hw_ret/stall (r27) OL 5 ; hw_ret/stall 


( 


( 


(istream restart) 


Here are some notes with respect to this flow: 


The ITB is only accessed on Icache misses 

r8, r9, r10, r11 & r27 are PAL shadows. 

The arcs show issue order dependencies that are not related to register data. 

The HW_RET instruction should have its STALL bit set to ensure that the restarted I-stream does not 
read the ITB until the ITB is written. 
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5. Internal Processor Registers 
This chapter describes EV6’s internal processor registers (IPRs). 


IPR Mnemonic Index, Score- Access MT/MF Latency 


board Type Issued for MFPR 
Bit from Ebox 
Pipe: 

Ebox IPRs 
cc 1100 0000 5 RW 1L 1 
CC_CTL 1100 0001 5 WwW 1L 
VA 1100 0010 4,5,6,&7 R 1L 1 
VA_FORM 1100 0011 4,56,&7 R 1L 1 
VA_CTL 1100 0100 5 W IL 
Ibox IPRs 
ITB_TAG 0000 0000 6 WwW OL 
ITB_PTE 0000 0001 4&0 WwW OL 
ITB_IAP 0000 0010 4 WwW OL 
ITB_LIA 0000 0011 4 WwW OL 
ITB_IS 0000 0100 4&6 WwW OL 
EXC_ADDR 0000 0110 R OL 3 
IVA_FORM 0000 0111 R OL 3 
CM 0000 10x1 4 RW OL 3 
IER 0000 101x 4 RW OL 3 
SIRR 0000 1100 4 RW OL 3 
ISUM 0000 1101 R OL 3 
HW_INT_CLR 0000 1110 4 W OL 
EXC_SUM 0000 1111 R OL 3 
PAL_BASE 0001 0000 4 RW OL 3 
I_CTL 0001 0001 4 RW OL 3 
IC_FLUSH 0001 0011 4 WwW OL 
PCTR_CTL 0001 0100 4 RW OL 3 
CLR_MAP 0001 0101 456&7 W OL. 
SLEEP 0001 0111 456&7 W OL 

_ LSTAT | 0001 0110 RW OL 3 
ASN Olxx xxxl 4 RW OL: 3 
ASTER Olxx xxlx 4 RW OL 3 
ASTRR Olxx x1xx 4 RW OL 3 
PPCE Olxx 1xxx 4 RW OL 3 
FPE Olxl xxxx 4 RW OL 3 
Mbox IPRs 
DTB_TAGO 0010 0000 2&6 WwW OL 
DTB_TAGI 1010 0000 1&5 WwW IL 
DTB_PTEO 0010 0001 0&4 WwW OL 
DTB_PTE1 1010 0001 3&7 WwW iL; 
DTB_IAP 1010 0010 7 W IL 
DTB_IA 1010 0011 7 WwW 1L 
DTB_ISO 0010 0100 6 WwW OL 
DTB_IS1 1010 0100 7 WwW IL 
DTB_ASNO 0010 0101 4 W OL 
DTB_ASN1 1010 0101 7 WwW 1L 
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IPR Mnemonic Index, Score- Access MT/MF Latency 
board Type Issued for MFPR 
Bit from Ebox 
Pipe: 
DTB_ALT_MODE 00100110 6 W OL 
MM_STAT 0010 0111 R OL 3 
M_CTL 0010 1000 6 WwW OL 
DC_CTL 0010 1001 6 WwW OL 
DC_STAT 0010 1010 6 RW OL 3 
Cbox IPRs 
DATA 0010 1011 6 RW OL 3 
SHIFT_.CONTROL 0010 1100 6 W OL 
TBox IPRs 
SL_XMIT . WwW 
SL_RCV R 3 
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5.1 Ebox IPRs 


5.1.1 CC 


The cycle counter register (CC) is a read/write register. The lower half of CC is a counter which, when 
enabled via CC_CTL<32>, increments once each CPU cycle. The upper half of the register is simply 32 
bits of register storage which may be used as a counter offset as described in the Alpha SRM. A 
HW_MTPR to the CC register writes the upper half of the register and leaves the lower half unchanged. 
The RPCC instruction returns the full 64-bit value of the register. 


31 0 


4 ‘ ' ‘ 4 ' ' § ' ' 4 i] ' ' i] 4 4 ' ' 4 ‘ ' ' ' ‘ 4 ' ’ t A 4 


COUNTER 


OFFSET 





§.1.2 CC_CTL 
The cycle counter control register (CC_CTL) is a write only register through which the lower half of the 
CC register may be written and its associated counter enabled and disabled. 


31 4 3 0 
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COUNTER 





Name T Description 

Counter<31:4> Ww This is the field through which CC<31:4> may be written. Writes to 
CC_CTL result in CC<3:0> being cleared. 

CC_ENA W Counter enable. When set, this bit allows the cycle counter to 
increment. 
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5.13 VA 


VA is a read-only register. When a D-stream TB miss or fault occurs the associated effective virtual 
| address is written into the VA register. VA is not written when a LD_VPTE gets a DTB miss or D-fault. 


31 0 
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Virtual Address 
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Virtual Address 
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5.1.4 VA_FORM 


VA_FORM is a read-only register containing the virtual page table entry address derived from the 
faulting virtual address stored in the VA register, and from the virtual page table base and associated 
control bits stored in the VA_CTL register. 





VA_48 == 
e 2  VA_FORM_32== 


VPTB<63:33> 





VA_48 == 1 
VA_FORM_32 == 





| L____» VA<47:42> 
SEXT(VA<47>) 


VPTB<31:30> 


VA_48 == 0 
VA_FORM_32 == 1 
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5.1.5 VA_CTL 


VA_CTL is a write-only register which controls the way in which the faulting virtual address stored in the 
VA register is formatted when read via the VA_FORM register. It also contains control bits which effect 
the behavior of the memory pipe virtual address sign extension checkers, and the behavior of the Ebox 


extract, insert and mask instructions. 
31 30 





L+» B ENDIAN 
| VA_48 
VA_FORM_32 
VPTB<31 :30> 


‘ 1 ' i] t ' i] i} t t ‘ t] 1 4 ' i] t ' 4 ’ ' ’ ' ' ' ' ' ' i t i] 


VPTB<63:32> 











Name T Description 
B_ENDIAN W,0 Big Endian Mode. When set 
e = the shift amount (Rbv<2:0>) is inverted for EXTxx, INSxx and 
MSKxx instructions 


e the lower bits of the physical address for D-stream accesses are 
inverted based upon the length of the reference: 

=> Byte: invert bits <2:0> 

= Word: invert bits <2:1> 

= Longword: inverts bit <2> 

| VA_48 Ww.0 This bit controls the format applied to effective virtual addresses by 

the VA_FORM register and the memory pipe virtual address sign 
extension checkers. When VA_48 is clear, 43-bit virtual address 
format is used, and when VA_48 is set, 48-bit virtual address format 
is used. The effect of VA_48 on the VA_FORM register is described 
above. 


| When VA_48 is set the sign extension checkers generate an ACV if: 

va<63:0> != SEXT(va<47:0>) 

| When VA_48 is clear and the sign extension checkers generate an 
ACV if: 

| va<63:0> != SEXT(va<42:0>) 

- VA _FORM_32 W,0 This bit is used to control address formatting on a read of the 
VA_FORM register. See the section on the VA_FORM register for 
details. 

VPTB<63:30> Ww Virtual Page Table Base. See the VA_FORM register section for 
details. 
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§.2 lbox IPRs 
This section describes the IPRs which control Ibox functions. 


5.2.1 ITB_TAG 

ITB_TAG is a write-only register through which the ITB tag array is written. A write to ITB_TAG 
actually writes a register outside the ITB array. When a write to the ITB_PTE register is retired, the 
contents of both the ITB_TAG and ITB_PTE registers are written into the ITB entry. The specific ITB 
entry that is written is determined by a round-robin mechanism; the mechanism writes to entry #0 as the 
first entry after chip reset. 


31 13 12 0 
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VA<31:13> 


VA<47:32> 





5.2.2 ITB_PTE 


ITB_PTE is a write-only register through which the ITB PTE array is written. A write to the ITB_PTE 
array, when retired, results in both the ITB_TAG and ITB_PTE arrays being written. The specific entry 
that is written is chosen by the round-robin mechanism described above. 


31 1312111098 7 6 §4 3 
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PFN<31:13> 
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PFN<43:32> 
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5.2.3 ITB_IAP 


ITB_IAP is a pseudo register which, when written to, invalidates all ITB entries and Icache blocks whose 
ASM bit is clear. The Icache flush will not occur until after the retire of the next encountered 
HW_RET/stall. 


5.2.4 ITBIA 


ITB_IA is a pseudo register which, when written to, invalidates all ITB entries and invalidates the entire 
Icache. The Icache flush will not occur until after the retire of the next encountered HW_RET/stall. 


5.2.5 ITB_IS 


I-stream Translation Buffer Invalidate Single (ITB_IS) is a write-only register. Writing a virtual page 

number to this register invalidates any ITB entry which meets one of the following criteria: ; 

e the ITB entry’s virtual page number matches ITB_IS<47:13> (or fewer bits if granularity hint bits are 
set in the ITB entry) and its ASN field matches the address space number supplied in the process 
context IPR: PCTX<46:39>. 

e the ITB entry’s virtual page number matches ITB_IS<47:13> and its ASM bit is set 

Note that since the Icache is virtually indexed and tagged, it is normally not necessary to flush the icache 

when paging. Therefore a write to ITB_IS will not flush the icache. 


5.2.6 EXC_ADDR 


The Exception Address (EXC_ADDR) register is a read-only register that j is updated by hardware when it 
encounters an exception or interrupt. If the exception was a fault, EXC_ADDR contains the PC of the 
instruction which triggered the fault. If the exception was a synchronous trap, EXC_ADDR contains the 
PC of the instruction after that which triggered the trap. For an interrupt, EXC_ADDR contains the PC of 
the next instruction which would have executed if the interrupt had not occurred. 


EXC_ADDR<0O> is set if the associated exception occurred in PAL mode. 


t t t ' 4 4 ' t t ' ' ' ? a 1 4 ' t t t ' 


PC<31:2> 


PC<63: 32> 





5.2.7 IVA FORM 


IVA_FORM is a read-only register containing the virtual page table entry address derived from the 
faulting virtual address stored in the EXC_ADDR register, and from the virtual page table base, VA_48 
and VA_FORM_32 bits stored in the I_CTL register. The IVA_FORM bit format is identical to 
VA_FORM. See section 5.1.4 
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5.2.8 IER CM 


TER_CM is a register which contains the interrupt enable (all active fields of the register except 
CM<1:0>) and current processor mode (CM<1:0>») bit fields. These two bit fields may be written either 
individually or together with a single HW_MTPR instruction. When bits <7:2> of the IPR index field of a 
HW_MTPR instruction contain the value 0000102, this register is selected. Bits <1:0> of the IPR index 
indicate which bit fields are to be written: bit<1> corresponds to the IER field, bit<O> corresponds to the 
processor mode field. A HW_MFPR of this register returns the values in both fields. 


31 3029 28 14 13 12 § 4321 0 


| | L_. IGN/RAZ 
CM<1:0> 
IGN/RAZ 


ASTEN 
SIEN<15:1> 
PCEN<1:0> 
CREN 








CM<1:0> RW Current Mode: 

00 Kernel 

01 Executive 

10 Supervisor 

11 User 
ASTEN RW AST Interrupt Enable. When set enables those AST interrupt requests 

which are also enabled by the value in ASTER. 

SIEN<15:1> RW. Software Interrupt Enables ; 
PCEN<1:0> RW Performance Counter Interrupt Enables 
CREN RW Corrected Read Error Interrupt Enable 
SLEN RW Serial Line Interrupt Enable 


EIEN<5:0> RW External Interru 


pt Enable 
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5.2.9 SIRR 


The Software Interrupt Request Register (SIRR) is a read/write register containing bits to request software 
interrupts. In order to generate a particular software interrupt, its corresponding bits in SIRR and 
IER<SIER> must both be set. 


31-29 28 14 13 0 
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| Mee i eer ae ee (SIR IS AS 
IGN/RAZ 
63 32 


IGN/RAZ 





Name T Description 
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5.2.10 ISUM 


The Interrupt Summary (ISUM) register is a read-only register that records all pending hardware, 
software and AST interrupt requests. 


31 30 29 28 1413 #1110 9 8 §43 2 0 


| aa RAZ 
ASTK 
ASTE 

RAZ 
ASTS 
ASTU 

RAZ 


Sl<15:1> 
PC<1:0> 
CR 








| L, SL 
El<5:0> 





Name T Description 

ASTx R AST Interrupts. For each processor mode, records whether an associated 
AST interrupt is pending. This include the mode’s ASTER and ASTRR 
bits, and whether the processor mode value held in the CM register is 
greater than or equal to the value for the mode. 





SI<15:1> R Software Interrupts 

PC<1:0> R Performance Counter Interrupts 
CR -Ro- Corrected Read Error Interrupts 
SL R Serial Line Interrupt 

EI<5:0> R External Interrupts 
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5.2.11 HW_INT_CLR 
HW_INT_CLR is a write-only register used to clear edge-sensitive interrupt requests 


31 30 2928 0 





Name Type Description 
PC<1:0> WIC Clears performance counter interrupt requests : 
CR WIC Clears corrected read error interrupt request 


SL WIC Clears serial line interrupt request 
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5.2.12 EXC_SUM 


The Exception Summary (EXC_SUM) register is a read-only register which records information about 
instructions which triggered traps. The register is updated at trap delivery time; its contents are only valid 
if it is read (via a HW_MFPR) in the first fetch block of the exception handler. There are three types of 
traps for which this register captures related information: 

e Arithmetic traps: the instruction generated an exceptional condition which should be reported to the 
operating system, and/or the FPCR status bit associated with this condition is clear and should be set 
by PALcode. Additionally, the REG field contains the register number of the destination specifier for 
the instruction which triggered the trap. 

e I-stream ACV: The BAD_IVA bit of this register indicates whether the offending I-stream virtual 
address is latched into the EXC_ADDR or VA register. 

e D-stream Exceptions: The REG field contains the register number of either the source specifier (for 
stores) or the destination specifier (for loads) of the instruction which triggered the trap. a 
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REG 
BAD_IVA 
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SET_UNF 
SET_INE 
SET_IOV 
a i) ee POS ON ea eae A eins AS BE cantata tet 
SWC R Indicates software completion possible. This bit is set if the instruction 
which triggered the trap contained the /S modifier. 
INV R Indicates invalid operation trap 
DZE R Indicates divide by zero trap 
FOV R Indicates floating point overflow trap 
UNF R Indicates floating point underflow trap 
INE R Indicates floating point inexact error trap 
IOV R Indicates Fbox convert to integer overflow or Ebox integer overflow 
trap 
INT R Set to indicate Ebox integer overflow trap, clear to indicate Fbox trap 
condition 
REG R 


Destination register of load or operate which triggered the trap OR 
source register of store which triggered the trap. These bits may contain 
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the Rc field of an operate instruction or the Ra field of a load or store 
instruction. The value is unpredictable if the trap was triggered by an 
ITB miss, interrupt, OPCDEC, or other non load/st/operate. 

BAD_IVA R Bad I-stream VA. This bit should be used by the IACV PALcode 
routine to determine whether the offending I-stream virtual address is 
latched in the EXC_ADDR register or the VA register. If BAD_IVA is 
clear, then EXC_ADDR contains the address, if BAD_IVA is set then 


VA contains the address. 

RSVD/IGN R,0 Reserved for hardware use. 

SET_INV R PALcode should set FPCR<INV> 
SET_DZE R PALcode should set FRCR<DZE> 
SET_OVF R PALcode should set FPCR<OVF> 
SET_UNF R PALcode should set FRCR<UNF> 
SET_INE R PALcode should set FPCR<INE> 
SET_IOV R PALcode should set FPCR<IOV> 


5.2.13 PAL_BASE 


PAL_BASE is a read/write register which contains the base physical address for PALcode. Its contents are 
cleared by chip reset. 


31 15 14 0 
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PAL_BASE<31:15> — 
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5.2.14 1CTL 


Tbox Control (I_CTL) is a read/write register which controls various Ibox functions. Its contents are 
cleared by chip reset. 












313029 2423222120191817 1615141312111098 76 543 21 0 





L» SPCE 
IC_ENe<1:0> 
SPE<2:0> 
SDE<1:0> 


SBE<1:0> 
BP_MODE<1:0> 
HWE 


VA_48 
VA_FORM_32 
SINGLE _ISSUE_L 
PCTO_EN 
PCT1_EN 

CALL _PAL_R23 
MCHK_EN 

TB _MB_EN 
-CHIP_ID_ 
VPTB<31 :30> 





VPTB<47:32> 


a a i es i Pe i i ee 1 | 


St ee SEX TV TB EATS) 





Name Type Description 
SPCE RW,0 System Performance counter enable. A performance counter is 


enabled if its individual enable is asserted (PCTRO or PCTR1) and 
either SPCE or the PPCE bit of the Ibox process context IPR is set. 


IC_ENABLE<1:0> RW,3 Icache set enable. The entire cache may be enabled by setting both 
bits. Zero, one, or two icache sets can be enabled. 

SPE<2:0> RW,.0 Super page mode enables - just like the SPE bits in the MBOX 
M_CTL IPR. 

SDE<1:0> RW,0 When set, enables access to the PAL shadow registers. 


If SDE<0> is set, R8-R11& R24-R27 are used as PAL shadows. 
If SDE<I> is set, R4-R7 & R20-R23 are used as PAL shadows. 
Both SDE<0> and SDE<1> may be set. However, this reduces the 
size of the physical integer register free pool, and may reduce 
overall system performance. 

SBE<1:0> RW,0 Stream Buffer Enable. The value in this bit field specifies the 
number of stream buffer prefetches (besides the demand-fill) which 
are launched after an Icache miss. If the value is zero, only demand 
requests are launched. 

BP_MODE<1:0> RW,0 Branch prediction mode selection: 
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FBDP 
| VA_48 


VA_FORM_32 
SINGLE_ISSUE_L 
PCTO_EN 
PCT1_EN 


CALL_PAL_R23 


MCHK_EN 
TB_MB_EN 
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RW,0 


RW,0 
RW,0 
RW,0 


RW,0 
RW,0 


RW,0 


RW,0 


RW,0 


RW.0 
RW0 


_ bottom-most entries of the IQ and FQ. 


__Description 


BP_MODE<1>: If set, forces all branches to be predicted 
fall-thru. If clear, the dynamic branch predictor is chosen. 
BP_MODE<O0>: If set, the dynamic branch predictor 
chooses local history prediction. If clear, the dynamic 
branch predictor chooses local or global prediction based 
on the state of the chooser. 

If set, allow PALRES intructions to be executed in kernel mode. 

Note that modification of the ITB while in kemel mode/native mode 

may cause unpredictable behavior. 

When set, forces bad Icache tag parity on fills. 

When set, forces bad Icache data parity on fills. 

This bit controls the format applied to effective virtual addresses by 

the IVA_FORM register and the Ibox virtual address sign extension 

checkers. When VA_48 is clear, 43-bit virtual address format is 

used, and when VA_48 is set, 48-bit virtual address format is used. 

The effect of this bit on the IVA_FORM register is identical to the 

effect of VA_CTL<VA_48> on the VA_FORM register. See section 

5.1.4 


When VA_48 is set the sign extension checkers generate an ACV 
if: 

va<63:0> != SEXT(va<47:0>) 
When VA_48 is clear the sign extension checkers generate an 
ACV if: 

va<63:0> != SEXT(va<42:0>) 
This bit also affects three additional functions: 
(1) JSR return address: The address is sign-extended from bit 47. 
Otherwise it is sign-extended from bit 43. 
(2) PC adder: The PC incrementer generates addresses in a 48-bit 
virtual address space instead of a 44 bit virtual address space. 
(3) DTB_DOUBLE Traps: if set, the DTB double miss traps vector > 
to the DTB_DOUBLE_4 entry point. 
This bit controls address formatting on a read of the IVA_FORM 
register. See the section 5.1.4 5.1.4 
When clear, this bit forces instructions to issue only from the 
Enable performance counter #0. If this bit is one, the performance 
counter will count if EITHER the system (SPCE) or process (PPCE) 
performance counter enable is set. 
Enable performance counter #1. If this bit is one, the performance 
counter will count if EITHER the system (SPCE) or process (PPCE) 
performance counter enable is asserted. 
When set, the CALL_PAL linkage register is R23, when clear it’s 
R27. This choice should correspond to SDE so as to ensure that a 
shadow register is used as the linkage register. 
Machine check enable - set to enable machine checks. 
When set, the hardware ensures that the virtual-mode loads in DTB 
and ITB fill flows which access the page table and the subsequent 
virtual mode load or store which is being retried are ‘ordered’ 
relative to another processor’s stores. This must be set for 
multiprocessor systems in which no MB instruction is present in the 
TB fill flow. 
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CHIP_ID<5:0> R This is a read-only field which supplies the revision ID number for 
the EV6 part. EV6 pass 1 parts will have a chip ID of 000001. 
VPTB<63:30> RW,0 Virtual Page Table Base. See section 5.1.4 for details. 
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5.2.15 I STAT 
Ibox Status (I_STAT) is a read/write register which contains Ibox status information. 


31 30 29 28 0 





TPE R,WIC If set, an Icache tag parity error occurred. 
DPE R,W1C If set, an Icache data parity error occurred. 





5.2.16 IC_FLUSH 


IC_FLUSH is a pseudo register which, when written, results in all Icache blocks being invalidated. The 
cache is actually flushed at the retire of the next encountered HW_RET/STALL instruction. 


5.2.17 CLR_MAP 


CLR_MATP is a pseudo register which, when written, results in the clearing of the current map of virtual 
to physical registers. This register must only be written after there are no register-borne dependencies 
present and there are no unretired instructions. See PALcode restrictions for a usage example. 


5.2.18 SLEEP 


SLEEP is a pseudo register which, when written, results in the PLL speed being reduced and the chip 
entering a iow-power mode. This register must only be written after a sequence of code has been run 
which saves all necessary state to DRAM, flushes the caches, and unmasks certain interrupts so the chip 
can be woken up. The details of this sequence are TBD . 
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5.2.20 Ibox Process Context IPR (PCTX) 


This register contains information associated with the context of a process. Any combination of the bit 
fields within this register may be written with a single HW_MTPR instruction. When bits <7:6> of the 
IPR index field of a HW_MTPR instruction contain the value 01, , this register is selected. Bits <4:0> of 
the IPR index indicate which bit fields are to be written. The correspondence between register fields and 
IPR index bits is: 





| + IGN/RAZ 
| PPCE 
FPE 








Name T Description 
ASN RW Address Space Number. 
ASTER - = RW -__ - AST Enable Register - used to individually enable each of the four AST 
a interrupt requests. The bit order with this field is: 
User Mode <li> 
Supervior Mode <10> 
Executive Mode <9> 
Kernel Mode <8> 
ASTRR RW AST Request Register - used to request AST interrupts in each of the 


four processor modes. In order to generate a particular AST interrupt, its 
corresponding bits in ASTRR and ASTER must be set, along with the 
ASTE bit in IER. Further, the value of the current mode bits in the PS 
register must be equal to or higher than the value of the mode associated 
with the AST request. The bit order with this field is: 


User Mode <1l> 
Supervior Mode <10> 
Executive Mode <9> 
Kernel Mode <8> 
PPCE RW Process Performance Counter Enable. Both performance counters are 
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enabled if either this bit is set or the SPCE bit of the I_CTL register is 


set. 
FPE RW,0 Floating Point Enable - if clear, floating point instructions generate FEN 
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5.2.21 PCTR_CTL 


Performance counter control (PCTR_CTL) is a read/write register which controls the function of the 
performance counters. 





31 28 27 26 25 65 4 3 0 
—_ PCTR1<19:0> 
| | L—__» SL1<3:0> 
| SLO<0> 
IGN/RAZ 
IGN/RAZ 
PCTRO<3:0> 
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PCTRO0<19:4> 


GV Clip Opeciiicauo, INCV L6G 





Name T Description 





SLI RW Select Input for Performance Counter #1 
0000: Retired Instructions 
0001: Retired Conditional Branches _ 
0010: Retired Branch Mispredicts — 
0011: Retired ITB Misses 
0100: Retired DTB Misses 
0101: Retired Unaligned Traps 
0110: Icache Misses 
0111: MBOX Replay Traps 
1000: Dcache Load Misses 
1001: Dcache Misses 
1010: Bcache Reads 
1011: Bcache Writes 
1100: SysPort Reads — 
1101: SysPort Writes 
1110: 
1111: 

SLO RW Select Input for Performance Counter #1 
0: Cycles 
1: Retired instructions 

PCTRI RW Performance Counter #1 

PCTRO RW Performance Counter #0 
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5.3 Mbox IPRs 
This section describes the IPRs which control Mbox functions. 


5.3.1 DTB_TAGO & DTB_TAGI 

DTB_TAGO and DTB_TAGI are write-only registers through which the two memory pipe DTB tag 
arrays are written. Writes to DTB_TAGO and DTG_TAGI actually write registers outside the DTB 
arrays. When writes to the corresponding DTB_PTE registers are retired, the contents of both the 
DTB_TAG and DTB_PTE registers are written into their respective DTB arrays at locations determined 
by the round-robin allocation algorithm. 


31 13 12 0 
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VA<31:13> 
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VA<47:32> 
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5.3.2 DTG_PTEO & DTB_PTE1 


DTB_PTEO and DTB_PTE] are registers though which the DTB PTE arrays are written. The entries to be 
written are chosen by a round-robin allocation scheme. Writes to the DTB_PTE registers, when retired, 
result in both the DTB_TAG and DTB_PTE arrays being written. 


1615141312111098 76543 21 0 


63 62 32 
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PA<43:13> 
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IGN 


5.3.3 DTB_ALTMODE 


DTB_ALTMODE is a write only register whose contents specify the alternate processor mode use by some 


HW_LD and HW_ST instructions. 
31 


210 





L—» ALT MODE<1:0> 








NAME ccccsustunseten Type. Bot 810. (21 Uae ae eee ae ee EN te ernaeaa eee NR 
ALT_MODE<1:0> RW Alt_Mode: 
00 ___ Kernel 
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NAME oc cccesnsssee te YPC... Description 
01 


Executive 
10 Supervisor 
11 User 


5.3.4 DTB_IAP 


D-stream Translation Buffer Invalidate All Process (DTB_IAP) is a write-only pseudo register. Writes to 
this register invalidate all DTB entries in which the address space match (ASM) bit is clear. 


5.3.5 DTB_IA 


D-stream Translation Buffer Invalidate All (DTB_IA) is a write-only pseudo register. Writes to this 
register invalidate all DTB entries and reset the DTB not-last-used pointer to its initial state. 
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5.3.6 DTB_ISO & DTB ISI 


The D-stream Translation Buffer Invalidate Single registers (DTB_ISO & DTB_IS1) are write-only 

pseudo registers through which software may invalidate a single entry in the DTB arrays. Writing a 

virtual page number to one of these registers invalidates any DTB entry in the corresponding memory 

pipeline which meets one of the following criteria: 

e the DTB entry’s virtual page number matches DTB_IS<47:13> and its ASN field matches 
DTB_ASN<63:56> 

e the DTB entry’s virtual page number matches DTBIS<47:13> and its ASM bit is set 


31 13 12 ce) 
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VA<31:13> 
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VA<47:32> 





5.3.7 DTB_ASNO & DTB_ASNI1 


The D-stream Translation Buffer Address Space Number registers (DTB_ASNO & DTB_ASN1) are 
write-only registers which should be written with the address space number of the current process. 


31 0 
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ASN<7:0> 
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5.3.8 MM STAT 


MM_STAT is a read-only register. When a D-stream TB miss or fault occurs information about the error 
| is latched in the MM_STAT register. MM_STAT is not locked by a LD_VPTE instruction. 


31 
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| L.» WR 
| | ACV 
FOR 


FOW 
DC_TAG_PERR 


63 32 








Name Description 

WR Set if the reference which triggered the error was a write 

ACV Set if the reference caused an access violation. Includes bad virtual address. 
FOR _ Set if the reference was a read operation and the PTE FOR bit was set. 
FOW Set if the reference was a write operation and the PTE FOW bit was set. 
OPCODE Opcode of the instruction which triggered the error. 


DC_TAG_PERR _ Set to indicate that a Dcache tag parity error occurred during the initial tag probe of a 
load or store instruction. This error created a synchronous fault to the D_LFAULT 
PALcode entry point, and is correctable. The virtual address associated with the error | 

is available in the VA register. 





Note: 


The Ra field of the instruction which triggered the error can be obtained from the Ibox 
EXC_SUM register. 
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5.3.9 M_CTL 
Mbox Control (M_CTL) is a write-only register, the contents of which are cleared by chip reset. 


31 43 1 (0 
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SPE<2:0> Wwo,0 Super Page mode enables. Only one (or none) may be set. | 





SPE<2>, when set, enables super page mapping when 
VA<47:46> = 2. In this mode VA<43:13> are mapped directly to 
PA<43:13> and VA<45:44> are ignored. 


SPE<1>, when set, enabies super page mapping when 
VA<47:41> = 7Ej¢. In this mode VA<40:13> are mapped directly to 
PA<40:13> and PA<43:41> are copies of PA<40> (sign extension). —_| 


SPE<0>, when set, enables super page mapping when 
VA<47:30> = 3FFFE}¢. In this mode VA<29:13> are mapped 
directly to PA<29:13> and PA<43:30> are cleared. 


Note: Super page accesses are only allowed in kernel mode. Non- 


kernel mode references to super pages result in access violations. 
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5.3.10 DC_CTL 


Deache Control (DC_CTL) is a write-only register that controls Dcache activity. The contents of 
DC_STAT are initialized by chip reset as indicated. 


31 


876543210 





L—» SET_EN<1:0> 
F_HIT 
FLUSH 


F_BAD_TPAR 
F_BAD_DECC 
DCTAG_PAR_EN 
DCDAT_ERR_EN 





SET_EN<1:0>W33 
F_HIT W.0 
FLUSH W.0 


F_BAD_TPAR W.0 


F_BAD_DECC W.0 


DCTAG_PAR_EN W,0 
DCDAT_ERR_ EN W,0 
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Deache Set Enable. At least one set must be enabled. 

Force Hit. When set, this bit causes all memory space load and store 
instructions to hit in the Dcache, independent of the tag status bits. 

In this mode, only one of the two sets may be enabled, and tag parity — 
checking must disabled (set DCTAG_PER_EN to zero). 

When the value written into the DC_CTL register contains a one in 
this bit position all the Dcache tag valid bits are cleared. 

Force Bad Tag Parity. If set, this bit causes bad tag parity to be put 
into the Dcache tag array during Dcache fill operations. . 

Force Bad Data ECC. If set, this bit ECC data to NOT be written into 
the cache along with the block that is loaded by a fill or store. This 
can be used to cause bad ECC to be present in the dcache by writing - 
the same block with different data than is already present. Since the 
old ECC value will remain, it will be ‘bad’ relative to the new data. 
Deache tag parity enable. 

Dcache data ecc and parity error enable 
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5.3.11 DC_STAT 


Deache Status (DC_STAT) is a read/write register. If a Dcache tag parity error or data ECC error occurs 
information about the error is latched in DC_STAT. 


31 § 43210 





ox TPERR_PO 

| TPERR_P1 
ECC_ERR_ST 
ECC_ERR_LD 


SEO 


t) ' t t ' ' i] ' 1 i] 4 1 i] 4 § i] 4 i] t ' 1 ‘ t i ' ‘ ‘ 1 4 ' i] 





Name T Description 
TPERR_PO R,W1C Tag Parity Error - Pipe 0. When set, this bits indicate that a Dcache . 
. . tag probe from pipe 0 resulted in a tag parity error. The error is 
uncorrectable and will result in a machine check. 

TPERR_P1 R,W1C Tag Parity Error - Pipe 1. When set, this bits indicate that a Dcache 
tag probe from pipe 1 resulted in a tag parity error. The error is 
uncorrectable and will result in a machine check. 

ECC_ERR_ST R,W1C When set, this bit indicates that an ECC error occurred while 
processing a store. 

ECC_ERR_LD R,WIC When set, this bit indicates that an ECC error occurred while 

processing a load (load data retrieved from dcache or bcache fill 
data). 

SEO — .. R,WIC Second Error Occurred. When set, this bit indicates that a tag parity 
or Store data ECC error occurred while the DC_STAT register was 
already locked; or that a Load data ECC error occurred while the 
DC_STAT register was already locked and error recovery was in 


progress. ce eo ariel 
5.4 Cbhox CSRs and IPRs 


The CBOX Control/Status Registers (CSR’s) are write-only registers which define system configuration, 
command processing, and timing parameters. They are written via a serial load/shift register. Six bits of 
data are written to CBOX_DATA IPR via a HW_MTPR instruction. When the instruction retires, the 
data in the register is shifted into the CBOX. The process is repeated until all CBOX data is shifted in. 


The CBOX Internal Processor Registers (IPR’s) are read-only registers which allow software access to 
system error information. They are read via a serial shift/read register. A SHIFT command is written to 
CBOX_SHIFT IPR. When the instruction retires, the CBOX_DATA IPR contains the first six bits of 
error information and can be read via a HW_MFPR instruction. The process is repeated until all CBOX 
data is shifted in. 
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C_DATA<S5:0> RW Cbox data register. Writes 6 bits of CSR data into serial shift 
register. When read (after C_SHIFT), allows access to 6 sequential 
bits of CBOX IPR data. 
lL» C_SHIFT<0> 
| Name Type Description 
| C_SHIFT<U> Wi When written (with a ‘1”} causes 6 bits of CBOX IPR data to shift 


into CBOX_DATA register where it can be read by software (via a 
HW_MFPR instruction). All bits of the CBOX IPR data scan chain 
must be shifted. 


5.4.1 CBOX CSR Description 


Note that the precise order of these CSR’s is TBD. 





| Name Description 
FRAME_SEL<2:0> Sets ratio of framing clock to bit time. This in turn specifies the 


number of samples per framing clock. Allowed values: 
0001: 1 (one sample per framing clock) 
0010: 2 
0100: 4 
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VICTIM_THRESH<7:0> 


BC_RDVICTIM<0>- - 


SYSCLK_RATIO<15:0> 


DUP_TAG_ENA<0> 
SET_DIRTY_ENA<2:0> 


ZEROBLK_ENA<1:0> 


SPEC_READ_ENA<0> 


SYSBUS_FORMAT<0> 
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Deache victim threshold. Number of dcache read victims to allow 
to accumulate in VAF before writing them bcache. The number of 
victims is specified by a set bit in the vector. Allowed values: 
00000001: 1 (write bcache on presence of one victim) 
00000010: 2 (write bcache on presence of two victims) 
00000100: 3 
00001000: 4 
00010000: 5 
00100000: 6 
01000000: 7 
10000000: 8 
When set, causes EV6 to abut victim writes with reads. Used for 
systems in which victim data and read data are on the same DRAM 
page. 
Ratio of CPU clock to SYSCLK. The final multiple of CPU clock 
period to SYSCLK period is calculated as 
1+ (0.5 * SYSCLK_RATIO) 
Allowed values are: 
0000000000000001: final multiple is 1.5 
00000000000000 10: final multiple is 2.0 
_ 0000000000000100: final multple is 2.5 
0000000000001000: final multiple is 3 
0000000000010000: final multiple is 3.5 
0000000000100000: final multiple is 4.0 


When set, indicates to EV6 that external system has a duplicate tag _ 


Enable sending set-dirty commands to system. Protocols 
supported: 
000: EV6 sends no dirty commands off chip 
001: EV6 sends clean-to-dirty commands 
010: EV6 sends shared/clean 
011: EV6 sends clean commands 
100: EV6 sends shared/dirty commands 
101: EV6 sends shared/dirty. AND clean commands 
110: EV6 sends all shared commands 
111: EV6 sends all commands off chip 
Enable zero block processing and commands: 
ZEROBLK_ENA<1>: Enables zeroblk commands to 
system (Multiprocessor systems or duplicate tagged 
systems need to see zeroblk commands) 
ZEROBLK_ENA<O>: If set, enables EV6 processing 
zeroblk command as zero-block. If clear, EV6 converts 
zeroblk commands to read-modified commands. 
Enable speculative reads (read commands sent to system before 
beache hit is known). 
Format of physical address as it appears on system bus. Two 
allowed configurations: 
0: Interleaved on bcache block boundries 
1: Page mode-hit 
(refer to chapter 6) 
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SYSBUS_MB_ENA<0> 
SYSBUS_ACK_LIMIT<4:0> 


STIO_32_LIMIT<0> 
Bcache Port Control/Status 
BC_ENA<0> 
BC_CLEAN_VICTIM<0> 


BC_SIZE<3:0> 


BC_RD_RD_BBL<1:0> 


BC_RD_CLK_RATIO <15:0> 


BC_RD_WR_BBL<5:0> 


BC_LATE_WR_BC<2:0> 
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When set, sends memory barrier commands (MB) to system. 
Encoded count of maximum number of outstanding commands the 
system can accept. Values are interpretted as: 
00000: INF (system can accept an infinite number of 
outstanding commands) 
00001: 1 (system can accept 1 outstanding command) 
00010: 2 (system can accept 2 outstanding commands) 


10110: 22 (system can accept 22 outstanding commands) 
If set, system is imposing a 32-byte limit for stores to IO space. 


If set, bcache is enabled 
If set, causes EV6 to notify system that a clean block is being 
evicted. EV6 sends a CleanVictimBlk command along with the 
victim address. 
Encoded bcache size. Allowed values are: 
0000: 1 MB 
0001: 2 MB 
0011: 4 MB 
0111: 8 MB 
1111: 16 MB 
Number of CPU cycles to insert between reads to different SRAM 
banks. If the bcache is built‘as one bank, the value should be zero. 
Values are interpretted as: 
00: 0 (no bubble cycles between reads to different SRAM 
banks) 
01: 1 (one bubble cycle) 
10: 2 (two bubble cycles) 
11: 3 (three bubble cycles) 
Ratio of bcache clock period to CPU clock period. The final 
multiple is computed as: 
1+ (0.5 * N) 
where N is encoded in BC_RD_CLK_RATIO as one of the 
following allowed values: 
0000000000000001: final multiple is 1.5 
0000000000000010C: final multiple is 2 
0000000000000100: final multiple is 2.5 
0000000000001000: final multiple is 3 
0000000000010000: final multiple is 3.5 
0000000000100000: final multiple is 4 
Encoded number of bcache clock cycles between a bcache read and 
write. Allowed values are: 
00000: zero clock cycles 
01111: fifteen clock cycles 
For Late Write synchronous SRAMs. The following three IPRs 
encode the total delay for which beache data is delayed from bcache 
address. The total delay is calculated as the sum of the specified 
number of Bcache and CPU Clock cycles plus the CPU clock phase 
offset. 


LATE_WR_BC encodes the number of Bceache clock cycles to delay 
the write data. (000 = 0 cycles; 111 = 7 cycles) 
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BC_LATE_WR_CPU<1:0> Encoded number of CPU Clock cycles to delay the write data (see 
above). (00 = 0 cycles; 11 = 3 cycles) 

BC_LATE_WR_PHASE<0> When set, delays write data by one CPU clock phase (see above). 

BC_BURST_MODE_ENA<0> When set, enables bcache burst mode. 

Internal Cbox CSRs . 

BC_RDCLK_VECTOR<15:0> — Vector describing bcache read clocks 1 bit per phase, 50% duty 
cycle 111000 for 1.5 





5.4.2 CBOX IPR Description 


The CBOX IPR’s are read 6 bits at a time: ERR_ADDR<43:38> comprises the first read-group; 
ERR_ADDR<43> is read in CBOX_DATA<5>. 





OG iis eg ease Be cca at csi sr as ac ns ee aaa 
ERR_ADDR<43:6> Address of last reported ECC or parity error 
ERR_CODE<2:0> Summary of where error was detected: 
000: No error 
001: Bcache tag parity error 


010: Triplicate tag parity error 
011: Memory data ECC error 
100: Bceache data ECC error 


101: Deache data ECC error 
ECC_SYNDROME<’7:0> Syndrome of last reported ECC error. 
RAZ<4:0> Padded zero’s to extend shift chain to a multiple of 6 ; 
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6. IEEE Floating Point Conformance 


EV6 supports the IEEE floating-point operations defined in the Version 6 of the Alpha SRM. Support for 
~ acomplete implementation of the IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE 
Standard 754-1985) is provided by a combination of hardware and software. EV6 provides several 
hardware features to facilitate complete support of the IEEE standard. These features are outlined in this 
section. 


e EV6 implements precise exception handling in hardware. 

e EV6 accepts both Signaling and Quiet NaNs as input operands and propagates them as specified by 
the Alpha Architecture. In addition, EV6 delivers a canonical Quiet NaN when an operation is 
required to produce a NaN value and none of its inputs are NaNs. Encodings for Signaling NaN and 
Quiet NaN are defined by the Alpha SRM, version 6. 

EV6 accepts infinity operands and implements infinity arithmetic as defined by the IEEE standard. 
EV6 implements SQRT for single (SQRTS) and double (SQRTT) precision in hardware. 
Denormal input operands produce an unmaskable Denorm Trap when used with arithmetic 
operations. CPYSE/CPYSN, FCMOVxx, and MF_FPCR/MT_FPCR are not arithmetic operations, 
and will pass Denormal values without initiating arithmetic traps 

e EV6 implements the following disable bits in the ay Control Register (FPCR): 

=> Underflow Disable (UNFD) 

= Overflow Disable (OVFD) 

=> Inexact Result Disable (INED) 

=> Division by Zero Disable (DZED) 

=> Invalid Operation Disable (INVD) 

If one of these bits is set and an instruction with the /S qualifier set generates the associated 
trapping result, EV6 produces the IEEE nontrapping result and supresses the trap. These 
nontrapping responses include correctly signed infinity, largest finite number, and Quiet NaNs 
as specified by the IEEE standard. EV6 will not produce a Denorm result for the underflow 
exception. Instead, a true zero (+0) is written to the destination register. In EV6 the FPCR 
Underflow to Zero (UNDZ) bit must be set if Underflow Disable (UNFD) bit is set. If desired, 
trapping on Underflow can be enabled by the instruction and the FPCR, and software may 
compute the Denorm value as defined in the IEEE Standard. 


EV6 records floating-point exception information in two places: 


e The FPCR status bits record the occurance of all exceptions that are detected whether or not the 
corresponding trap is enabled. The status bits are cleared only through a explicit clear command 
(MT_FPCR), hence the exception information they record i is a summary of all exceptions that have 
occurred since the last time they were cleared. 

e If an exception is detected and the corresponding trap is enabled by the instruction, and is not 
disabled by the FPCR control bits, EV6 will record the condition in the EXC_SUM register and 
initiate an arithmetic trap. 


The following tables list all exceptional inputs and output conditions recognized by EV6, the result and 
exception generated for each condition. Notes: 


e V6 will always trap on a Denormal input operand for all arithmetic operations. 


e Input operand traps take precedence over arithmetic result traps. 
e Abbreviations used in table: 
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=> Inf: Infinity 

=> QNaN: Quiet NaN 

= SNaN: Signalling NaN 

=> CQNaN: Canonical Quiet NaN 
Alpha AXP Instructions EV6 Hardware Supplied Result Exception 
ADDx SUBx INPUT 
Inf operand +/-Inf (none) 
QNaN operand QNaN (none) 
SNaN operand QNaN Invalid Op 
Effective subtract of two Inf CQNaN Invalid Op 
operands 
ADDx SUBx OUTPUT 
Exponent overflow — +/-Inf or +/-MAX Overflow 
Exponent underflow +0 Underflow 
Inexact result Result Inexact 
MULx INPUT 
Inf operand +/-Inf (none) 
QNaN operand QNaN (none) 
SNaN operand QNaN Invalid Op 
0 * Inf CQNaN Invalid 
MULx OUTPUT 
(same as ADDx) 
DIVx INPUT 
QNaN operand QNaN (none) 
SNaN operand QNaN Invalid Op 
0/0 or Inf/Inf CQNaN Invalid Op 
A/0 (A not 0) +/-Inf Div Zero 
A/Inf +/-0 (none) 
Inf/A +/-Inf (none) 
DIVx OUTPUT 


(same as ADDx) ; 


SQRTx INPUT 
+Inf operand +Inf (none) 
QNaN operand QNaN (none) 
SNaN operand QNaN Invalid Op 
. -A(Anot0) © CQNaN : Invalid Op 
-0 -0 (none) 
SQRTx OUTPUT 
Inexact result root Inexact 
CMPTEQ CMPTUN INPUT 
Inf operand True or False (none) 
QNaN operand False for EQ, True for UN (none) 
SNaN operand False for EQ,True for UN Invalid Op 
CMPTLT CMPTLE INPUT 
Inf operand True or False (none) 
QNaN operand False Invalid Op 
SNaN operand False Invalid Op 
CVTfi INPUT 
Inf operand 0 Invalid Op 
QNaN operand 0 Invalid Op 
SNaN operand 0 Invalid Op 
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Alpha AXP Instructions EV6 Hardware Supplied Result Exception 
CVTfi OUTPUT 


Inexact result Result Inexact 
Integer overflow Truncated result Invalid Op 
CVTif OUTPUT 
Inexact result Result Inexact 
CVTff INPUT 
Inf operand +/-Inf (none) 
QNaN operand QNaN (none) 
SNaN operand QNaN Invalid Op 
CVTff OUTPUT 

same as ADDx 
FBEQ FBNE FBLT FBLE FBGT 
FBGE 
LDS LDT 
STS STT 
CPYS CPYSN 
FCMOVx 
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6.1 Floating Point Control Register (FPCR) 


31 , 0 
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IGN/RAZ 





63 62 616059 58 57 56 55 54 5352 51 5049 48 32 


IGN/RAZ 





SUM RW Summary bit. Records bit-wise OR of FPCR exception bits. 

INED RW Inexact Disable. If this bit is set and a floating point instruction which 
enables trapping on inexact results generates an inexact value, the result 
is placed in the destination register and the trap is suppressed. 

UNFD RW Underflow Disable. If UNFD and UNZD are set and a floating point 
instruction which enables trapping on underflow and which has the 
software completion qualifier set generates an underflow, then the trap is 
suppressed. 

UNZD RW Underflow to zero. The Alpha architecture specifies that if this bit is set 
along with UNFD, then on underflow implementations place an 
appropriately signed zero value in the destination register rather than the --- - 
denormal number specified by the IEEE standard, if they are capable of 
doing so. 


EV6 is not capable of generating IEEE compliant denormal results and 
always generates a positive zero (+0.0) on underflow. Hence this bit is 
only used along with UNFD to determine whether to suppress underflow 
traps. 

DYN RW Dynamic rounding mode. Indicates the rounding mode to be used by an 
IEEE floating point instruction when the instruction specifies dynamic 
rounding mode: 


00, Chopped 
01, Minus infinity 
10, Normal 
11, Plus infinity 
IOV RW Integer overflow. An integer arithmetic operation or a conversion from 


floating to integer overflowed the destination precision. 
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Inexact result. A floating arithmetic or conversion operation gave a result 
that differed from the mathematically exact result. 

Underflow. A floating arithmetic or conversion operation gave a result 
that underflowed the destination exponent. 

Overflow. A floating arithmetic or conversion operation gave a result that 
overflowed the destination exponent. 

Divide by zero. An attempt was made to perform a floating divide with a 
divisor of zero. 

Invalid operation. An attempt was made to perform a floating arithmetic 
operation and one or more of its operand values were illegal. 

Overflow disable. If this bit is set and a floating arithmetic operation 
generates an overflow condition, then the appropriate IEEE non-trapping 
result is placed in the destination register and the trap is suppressed. 
Division by zero disable. If this bit is set and a floating divide by zero is 
detected, the appropriate IEEE non-trapping result is placed in the 
destination register and the trap is suppressed. 

Invalid operation disable. If this bit is set and a floating operate generates 
an invalid operation condition and EV6 is capable of producing the 
correct IEEE nontrapping result, that result is placed in the destination 
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7. Error Detection and Handling 


This section gives an overview of EV6’s error detection and error handling mechanisms. 


The system port data bus is quadword ECC protected. 

The Bcache tag is parity protected. . 

The Bcache data bus is quadword ECC protected. 

The Dcache tag array is parity protected. 

The Dcache data array is quadword ECC protected, however this mode of operation is only supported 
in systems in which ECC is enabled on both the system and Bcache ports. 

The Icache tag array is parity protected. 

The Icache data array is parity protected. 

e The Dcache duplicate tag array is ECC protected. 


The EV6 ECC implementation detects and corrects single bit errors in hardware. Multiple bit errors 
within a quadword are not detected. 


7.1 lcache Data or Tag Parity Error 


e The hardware detects the error, replay-traps the instructions which were fetched under the error, and 
flushes the entire icache so the re-fetched instructions are not sourced directly from the icache. 

e I_STAT<TPE> or <DPE> is set. 

e ACRD interrupt ts posted. 


7.2 Dcache Tag Parity Error 


The primary copies of the Dcache tags are only used when servicing CPU-generated loads and stores, 
hence a Dcache tag parity error is processed as a fault. 
e Machine check occurs before any machine state is changed. 
e EXC_ADDR contains the PC of the load or store instruction which triggered the error. 
e The TPERR_PO and TPERR_P1 fields of the DC_STAT register are written to indicate the source of 
the error. 
The virtual address associated with the error is available in the VA register. 
Recovery: flush the errored block using the EDCB (Evict Data Cache Block) instruction. The on-chip 
duplicate tag provides the correct victim address and cache state. 


7.3 Deache Data Correctable ECC Error 


The actions which may invoke Dcache data ECC errors are: 


e Load instructions 
e = Stores of less than quadword length 
e Dcache Victim Reads 


The hardware flow used for Dcache data ECC errors depends the action which triggered the error. 


7.3.1 Load Instruction 


Load instructions only trigger Dcache ECC errors if they use the data, i.e. if they hit in the Dcache. Loads 
which read their data from the Dcache may do so either in the same cycle as the Dcache tag probe (typical 
case) or in some subsequent cycle (load-queue retry). The hardware flows for these two error cases differ 
slightly. 
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If an ECC error occurs when a load reads the Dcache data array in the same cycle as the tag array, then 
the Ibox stops retiring instructions before the offending load retires, and does not start retiring again until 
after hardware recovers from the error. 


If an ECC error occurs when a load reads the Dcache data array after it read the Dcache tag array, then 
the load may already have retired. 


In either case: 


e The load’s destination register is written with incorrect data, however the load queue will retain the 
state associated with the load instruction. 

e Aconsumer of the load’s data may issue before the error is recognized, however the Ibox will invoke 
a replay trap at an instruction which is older than (or equal to) any instruction which consumes the 
load’s data, and then stalls the replayed I-stream in the map stage of the pipeline until the error is 
corrected. pe 

e The Cbox scrubs the block in the Dcache, which it does by evicting the block into the victim buffer 
(thereby scrubbing it) and writing it back into the Dcache. 

e The load queue retries the load and rewrites the register. 

e Acorrected read (CRD) interrupt is posted. 

e DC_STAT register: 

=> DECC_ERR set 
= DECC_COR set 


7.3.2 Store Instruction (Less than Quadword Length) 


A store of less than quadword length could invoke a Dcache ECC error since the original quadword must 
be read to calculate the new check bits. 


e The Mbox scrubs the original quadword and replays the write. 
e The Mbox posts a CRD interrupt: 
e DC_STAT register: 

=> DECC_ERR set 

= DECC_COR set 


7.3.3 Victim Reads 
e EcC-errored Dcache victims are scrubbed as they are written into the victim data buffer 
A CRD interrupt is posted. 
e DC_STAT register: 
= DECC_ERR set 
= DECC_COR set 


7.4 Deache Triplicate Tag Parity Error 

e Machine check 

e C_STAT: TPERR is set. 

e C_ADDR: contains bits <43:6> of the address associated with the error. 


7.5 Bcache Tag Parity Error 

Machine check 

C_STAT: TPERR is Set. 

C_ADDR: contains bits <43:6> of the address associated with the error. 
BC_TAG: contains the tag and tag control fields of the errored Bcache block. 
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e Bcache Tag Parity Errors are not recoverable. 


7.6 Beache Data Correctable ECC Error 
The actions which may trigger Bcache data ECC errors are: 


e = Icache fill 
e Decache fill, data possibly used by load instruction. 
e Victim read invoked by a system port probe or by the processor’s own reference stream. 


Independent of the action which triggered the error: 


e ACRD interrupt is posted 
e C_STAT: BC_ECC is set. ECC_CRD is set. 
e C_ADDR: contains bits <43:6> of the address associated with the error. 


The recovery mechanism depends on the action which triggered the error. 


7.6.1 Icache Fill from Bcache 

For an Icache fill, bad Icache data parity is generated for the octaword which contains the errored 
quadword. 

e The hardware flushes the icache 

e C_STAT: BC_ECC is set. 

e A machine check is invoked.. The PAL machine check handler must scrub the block in the bcache. 


7.6.2 Deache Fill from Bcache 


If the errored quadword is not used to satisfy a load instruction no hardware recovery flow is invoked - the 
errored quadword and its associated check bits are written into the Dcache 


If the errored quadword is used to satisfy a load instruction then the flow is very similar to that used for a 
Deache ECC error: 


e The load’s destination register is written with incorrect data, however the load queue will retain the 
State associated with the load instruction. 

e Aconsumer of the load’s data may issue before the error is recognized, however the Ibox will invoke 
a replay trap at an instruction which is older than (or equal to) any instruction which consumes the. 
load’s data, and then stalls the replayed I-stream in the map stage of the pipeline until the error is 
corrected. 


e The Cbox scrubs the block in the Dcache, which it does by evicting the block into the victim buffer 
(thereby scrubbing it) and writing it back into the Dcache. 
e The load queue retries the load and rewrites the register. 


7.6.3 Victim Read 
The errored quadword is written to the system port. It is not scrubbed. 


7.7 Beache Data Uncorrectable ECC Error 


e Machine Check 
e C_STAT: BC_ECC is set. ECC_CRD is clear. 
e C_ADDR: contains bits <43:6> of the address associated with the error 
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7.8 Memory Data Correctable ECC Error 
The actions which may trigger memory data ECC errors are: 


e = Icache fill 
e Dcache fill, data possibly used by load instruction. 


Independent of the action which triggered the error: 


e ACRD interrupt is posted 
e C_STAT: MEM_ECC is set. ECC_CRD is set. 
e C_ADDR: contains bits <43:6> of the address associated with the error. 


The recovery mechanism depends on the action which triggered the error. 


7.8.1 Icache Fill from Memory 

For an Icache fill, bad Icache data parity is generated for the octaword which contains the errored 

quadword. 

e The hardware flushes the icache 

e C_STAT: MEM_ECC is set. 

e A machine check is invoked.. The PAL machine check handler must scrub the block in the bcache 
and memory. 


7.8.2. Deache Fill from Memory 


If the errored quadwerd is not used to satisfy a load instruction no hardware recovery flow is invoked - the 
errored quadword and its associated check bits are written into the Dcache and later written into the 
Bcache. 


If the errored quadword is used to satisfy a load instruction then the flow is very similar to that used for a 
Deache ECC error: 


e The load’s destination register is written with incorrect data, however the load queue will retain the 
state associated with the load instruction. 

e Aconsumer of the load’s data may issue before the error is recognized, however the Ibox will invoke 
a replay trap at an instruction which is older than (or equal to) any instruction which consumes the 
load’s data, and then stalls the replayed I-stream in the map stage of the pipeline until the error is 
corrected, | | 

e The Cbox scrubs the block in the Dcache, which it does by evicting the block into the victim buffer 
(thereby scrubbing it) and writing it back into the Dcache. 

e The load queue retries the load and rewrites the register. 


7.9 Memory Data Uncorrectable ECC Error 


e Machine Check 
e C_STAT: BC_ECC is set. ECC_CRD is clear. 
e C_ADDR: contains bits <43:6> of the address associated with the error 


7.10 System Port Read Errors 


e Machine Check 
e C_STAT: SRDERR set. 
e C_ADDR: contains bits <43:6> of the address assoiciated with the error. 
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8. Initialization and Test 
**To be specified. Here are a few key points: 


The SROM port will work just like EV4 and EVS. 

The SROM pins can do double-duty as a software-controlled UART, just like EV4 and EVS. 

Unlike EV4 and EVS, systems are required to have an SROM - that will be the only way to configure ~ 
the system port. 

There will be an IEEE 1149.1 compliant test access port. 

There will be Built-in-Self-Test (BIST) of all major storage arrays, and Built-in-Self-Repair (BISR) of 
the Icache and Dcache tag and data arrays. 
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9. Electrical Data 


This chapter describes the electrical characteristics of EV6 and its interface pins. It will contain electrical 
characteristics, DC characteristics, AC characterisitcs and power supply considerations. 


9.1 Electrical Characteristics 


The following table lists the maximum ratings for EV6. 


Characteristics Ratings 

Storage Temperature -55C to +125C 
Junction Temperature 27Ctol100C it; 
Supply Voltage VSS OV, VDD 2.0 V 
Input or Output applied -0.5 to TBD V 
Maximum Power @ VDD=? TBD W typical 
Frequency=TBD MHz TBD W maximum 


9.2 DC Characteristics 


9.2.1 Power Supply 
The VSS pins are connected to 0.0V, and the VDD pins are connected to 2.0V, +/- 100mV. 


9.2.2 Input Signal Pins 
Nearly all input signals are CMOS inputs with 2.0V levels. The one exception is CLK_IN_H/L. 


9.2.3. Driven Signals From EV6 


EV6 requires a floating well type driver on the Bcache I/O interface, due to Bcache configurations that 
may drive voltages in excess of a threshold voltage above the 2V VDD. All I/O cells will use the same 
floating well design, however the drive strengths will not be the same. 


The output only cells will not use a floating well design, but, will use a sunple pushypul circuit. More 
than-one drive strength may be required tor this pin category. 


The SROM pins must be truly TTL compatible. This is achieved by employing open drain pulldown 
circuits. A resistor must be placed on the module to the 3.3V supply to pull the signal past the TTLVih 
point. 


Some lines will be either series or parallel terminated. For the parallel terminated lines, the chip will 
provide good Voh and Voi margins while sourcing or sinking the termination current, as defined in the 
table below. 


(TBD) 
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The following table show the drive specifications for each category of driver(typical-typical process/ Tj=85 
degrees C/(VDD=2.0V). 


EV6 IO SPECIFICATIONS 

Parameter Units Current Notes 
Voh_ca | VDD - 0.25V 45 mA 1 
Vol_ca 0.25V 25 mA 1 
Voh_cd VDD - 0.25V 2mA 2 
Vol_cd 0.25V 2mA 2 
Vih Vref +/- 250mV TBD mA 3 
Vil Vref +/- 250mV TBD mA 3 
Vol_OD 0.25V 75 mA 4 
Vih_HV VDD/2 + 250mV TBD mA 5 
Vil_HV VDD/2 - 250mV TBD mA 5 
Vol_OD 3V 0.4V 4mA 6 
1. Applies to cache address drivers 

2. Applies to cache data drivers 

3. Applies to all 2.5V tolerant inputs 

4, Applies to ‘OPEN DRAIN’ type outputs,(assumes a TDB ohm resistor connected to the 3.3V supply 


on the module. 
5. Applies to 3V tolerant pins. 
6. Applies to 3V open drain outputs ( assumes resistor to 3.3V supply). 


9.3 AC Characteristics 


This section describes the ac timing specifications for the 21164. 


9.3.1 Clocking Scheme 


The System port clocking scheme is described in detail in section 3.3.12. It requires a differential input 
clock and a system or framing clock. There is one signal that is considered synchronous while all other 
signals employ a clock forwarding scheme that is described in section 3.3.12. The system or FrameClk_H 
is used to establish a starting point for multi-cycle transfers of command and data to the system. It is also 
used to perform a synchronous clock forward reset of the interface. 


9.3.2 Input Clocks 


The differential input clock signals CLK_IN_H/L are frequencies ranging from 80 to 200 Mhz. Systems 
Choose the appropriate frequency within that range which matches the requirement for their own clock 
distribution. This input clock is used to compare against a divided down copy of the VCO output for 
phase alignment. The GCLK is not the product of this oscillator input so there is no requirement for 
specialized circuitry to detect the presence of CLK_IN_H/L. 


One additional input clock is a single ended square wave clock called the framing clock. This is expected 
to be a skewed controlled copy of the exact clock distributed throughout the system. The period of this 
clock can be identical to the osc_clk_in_h/l or an integer multiple of that signal and it should be phase 
aligned with the osc_clk_h/l with distribution skew not in excess of TDB psec. 


It has a two functions. First, it is used to provide a known starting point for all clock forwarding transfer 
that emanate from EV6. Second, it is the clock used for the only synchronous signal in the interface, the 
clock forward reset. 
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9.3.2.1 Clock Termination and Impedance Levels 


9.3.2.2 AC Coupling 


9.3.3 


9.3.4 1.3.3Analog PLL 
he oie 2c oie ae she afc he oie oie ae afc 2h PUT PLL SPEC HERE **#*#**#4#4% 


9.3.5 Timing 


9.3.5.1 Synchronous Signals 


There is one synchronous output signal driven from EV6. That is, CIkFwdReset_H. All clock forward 
circuits are reset using this signal. It is operationally specified in section 3.3.11.11. This signal is 
clocked by a copy of the Framing clock described above and is not derived from GCLK off the internal 
PLL. Therefore, there it does not have a skew component caused by the drift in the PLL. The timing 
specification is the delay from the input of the framing clock to the output pad driver and that is TBD min 
and TBD max. 


9.3.5.2 Asynchronous Signals 


The following is a list of asynchronous input signals: 
IRQ<5:0> SROMDATA DC _OK_H RESET 


9.3.5.3 Clock Forwarded Signals For System Interface 


Clock forwarding is described in detail in section 3.3.11.4. The following is a list of input only signals 
that are accompanied by a clock and are open drain: v=< 


SysAddIn<14:0> SysFillValid | SysDataInValid SysDataOutValid 


| | 450 MHz 
| min Setup | -200psec__| -300psec__| -400psec 


9.3.5.3.1 










The following is a list of output signals that are accompanied by a clock and are open drain: 
SysAddOut<14:0> SysAddOutCikH 
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pL 450 MHz | SOOMHz | 6OOMHz 


skew 


The following is a list of bi-directional signals that are accompanied by a clock, when Ev6 is driving the 
data bus there is one clock for 18 bits (ie SysClkOut<3> is associated with SysData<63:48> and 
SysCheck<7:6>, when EV6 is receiving there is on clock for every 9 bits. The drivers are open drain: 


SysData<63:56> SysCheck<7> SysDataInClik<7> SysDataOutClk<3> 
SysData<55:48> SysCheck<6> SysDataInClk<6> SysDataOutCik<3> 
SysData<47:40> SysCheck<5> SysDataInClk<5> SysDataOutCIk<2> 
SysData<39:32> SysCheck<4> SysDataInClk<4> SysDataOutClk<2> 
SysData<31:24> SysCheck<3> SysDataInClk<3> SysDataOutClik<1> 
SysData<23:16> SysCheck<2> SysDataInCik<2> SysDataOutClik<1> 
SysData<15:8> SysCheck<1> SysDataInClk<1> SysDataOutClk<0> 
SysData<7:0> SysCheck<0> SysDataInClk<0> SysDataOutClk<0> 













Incoming direction with respect to SysDataInClk 

| SC* 4S MHz | SOOMHz | 600MHz 
| min Setup | -200psec__| -300psec [-400psec sd 
1005 psec 









Timing difference across outputs including SysDataOutClk 


| 450 MHz SO0MHz [| 600MHz 


max output 560 psec 560 psec 560 psec 
skew 





9.3.5.4 Beache Timing 


The Bcache is entirely private to the EV6 pinbus. Address and control are directly driven from EV6 
along with multiple differential clocks. There is internal adjustment which delay the clock relative to the 
address and control. All signals to the synchronous SRAM devices are directly driven to the device. Data 
is bidirectional. For writing, data is clocked in the SRAM by the same clock used for address and control. 
For reading, there are two styles of data delivery to EV6. First, the conventional REG/REG component 
drives data on the rising edge of its received clock and EV6 uses a copy of this same clock to capture this 
data at the pads. The second type of data from the SRAM is clock forwarded from the device with data 
supplied on the rising and falling edge of the clock. 


The following signals are driven from EV6 to the devices and must meet the setup and hold 
constraints of the receiving device: 


BcAddress<23:4> BcDataOE_L BcLoad_L BcDataWr_L 
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BcTagOE_L BcTagWr_L 


Clocks: BcDataOutClk<3:0>H/L 


The following signals are Bidirectional. When EV6 is driving, these signals must meet the setup and hold 
constraints of the receiving device: 


BcData<127:0> BcCheck<7:0> 


Clocks: BcDataInClk<7:0>H/L 











. . ° 
. ra 


Incoming direction with respect to BcDataInClk 
| CSCSC~*«dECASOMHiz’_ | SOOMHz =| COOMHz id 
| -200 psec | -300psec | 400 psec Cd 


BcTag<42:20> BcTagValid_H BcTagDirty_H BcTagShared_H BcTagParity_H 
Clocks: BcTagInCikH/L 






po 4S50MHz | SOOMHz | GOOMHz 
| -200 psec__| -300psec | 400 psec 










9.4 Power Supply Considerations 


9.4.1 Decoupling 


9.4.2 Power Supply Sequencing 
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10. Packaging Information 


10.1 Introduction 


This chapter provides detailed information on the chip package and the complete pinout for the 587 pin 
ceramic PGA for EV6. 


10.2 Package Information 


The following figure shows the pin location and the package dimensions. 
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10.3 EV6 Pinout 


BE43 BE41 BESS BE37 BESS BE33 BE31 BE29 SE27 BE25 BE23 BE21 BE1IS9 BE17 BE1S BE13 BEi!1 BES BE7 BES BES 
BD44 8D42 BO4@ BN38 BD36 BD34 BO32 BN3@ BD28 8N26 BD24 BD22 BN2@ BDi8 BD1G BD14 BDi2 BN1ie BOS BOG BD4 BN2 
BC4S BC43 BC41 BC39 BC37 BC3S BC33 BC31 BC29 BC27 BC2S BC23 BC21 BCiS BCi7 BC1S BCi3 BC1i BCS BC? BCS BC3 BCI 
6844 8B42 BB48 BB38 BB96 8834 8832 BB32 BB28 8826 BB24 B822 BB26 6818 BIG BB14 BBi2 BBie BBS BSG 8B4 BB2 
BA4S BA43 BA41 BA39 BA37 BASS BAS3 BAS! BAZS BA27 BAZS BA23 BA21 BAIS BAI7 BAIS BAI3 BAI! BAS BAZ BAS BAS BAI 
AY44 AY42 AY4@ AY3B AY3SG AY34 AY32 AY3@ AYZE AY26 AY24 AY2Z2 AY2@ AYIG AY16G AY14 AY12Z2 AY1@ AYS AYG AY4 AY2 
AN4S AW43 AW41] AN3SS ANZ7 AWSS AWI3 AWS! AN2S AW27 AWZ2S AW23 AW21 AWLS AWI7 AW1S AWI3 ANI] AWS AW? AWS AWS ANI 
AV44 AVS2 AV4E AV3B AVS6 AVI4 AV3Z AVSA AVZB AVZ6 AV2Z4 AV22 AVZ@ AVIS AVI6 AVI4 AVIZ2 AVIA AVE AVE AVS AVZ 
AU45 AU43 AU4L AUSS AU? AUS AUS AU 
AT&4 AT42 AT48 AT3B ATS ATE ATS AT2 
AR4S AR43 AR41 AR3S AR? ARS ARS ARI 
RP44 AP42 AP4@ AP3B APa APG AP4 AP2 
AN4SS AN4S3 AN41 ANZS ANZ ANS ANZ ANI 
AM44 AM42 AM4@ AM3B AMB AMG AMS AMZ 
AL4S AL43 AL41 AL3S AL? ALS ALB ALI 
AK44 AK42 AK48 AK38 RKB AKG AK4 AK2 
AI4S AIS3 AI41 RIZS AIZ AI AIZ AIL 
AH44 AHS2 AH4B AH3SB AH8 AHG AHS AH2 
AG4S AC43 AG4! AC3S AG? AGS AG3 AGC! 
AFS4 “HF42°AF4O AFSB 7 ‘\Y AFS AFG AFA AF2 
RE4SS AES3 AE41 AESS AE? RES AES AE1 
AD44 ADS42 AD4B ADB ADs ADE ADS AD2 
AC4S AC43 AC41 AC3S ac? ACS ACS ACI 
AB44 AB42 ABSA ABS ABS ABE ABS AB2 
AR4S AA43 AAA! AAS AA? ARS AAS AAI 
y4a Y42 Y4@ Y3B8 Ya Y6 Y4 Y2 
W45 W43 W4l WBS WA WS = OWS3 OW 
V44 V42 V48 V38 Va V2 
u45 U43 U4i uss uz US U3 Ui 
T44 T42 T4e@ 138 T Te T4 T2 
R4S R43 R41 RBS R? RS RIB Ri 
P44 P42 P4&@ P38 P8 P&G P4& = P2 
N45 N43 N41 NSS NN? NS NB ON1 
M44 M42 M46 «M38 MS M2 
L45 L143 L141 133g uw ws 3 Li 
K44 K42 K4@ K38 K8 K6& K4 = K2 
45 J43 J41 339 J? 35 J32 J! 
H44 H42 H4@ HSB H36 H34 HIZ HBG H28 H26 H24 H22 H28 HIB HI6 H14 Hi2 Hi@ HE H4 H2 
S45 G42 G4l Gss G57? G35 G33 G31 G23 G27 G25 G23 G21 GIS C17 GIS G:3 Gil Gs G7 GS *-63° Gi 
F44 F42 F4@ F3B F36 F34 FI2 F320 F28 F26 F24 F22 F208 FIB FIG FI4 FI2 FI@ FRE FE Fa F2 
E45 €43 E41 E39 E37 £35 E33 £31 E29 E27 E25 E23 E21 £19 E17 EIS E13 E!!1 ES9 EF ES E3 €E1 
D44 042 O4@ O38 O36 034 032 O38 O28 D026 024 O22 O20 O18 O16 D14 D12 O18 O08 OF D4 22 
c45 c43 C4t C39 C3? C35 C33 C3i C29 C27 C25 c23 C21 Cig C17 CiS C13 Cil cS CA CS c3 Ci 
B44 842 B84@ 838 836 834 B32 B3@ B28 B26 B24 B22 B82@ BIG BiG B14 BI2 BIO BS B86 84 B82 
R43 R41 ABS AZZ ABS ABS ABI AZS A277 AZS AZZ A2I AIS AIZ AIS AIZ ALL AS AP AS AB 
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Signal Pin 


Signal Name 
BCADDRESS_ 4 
BCADDRESS 5 
BCADDRESS 6 
BCADDRESS_7 
BCADDRESS 8 
BCADDRESS 9 
BCADDRESS_10 
BCADDRESS 11 
BCADDRESS_12 
BCADDRESS_13 
BCADDRESS 14 
BCADDRESS_15 
BCADDRESS_ 16 
BCADDRESS_17 
BCADDRESS_ 18 
BCADDRESS 19 
BCADDRESS 20 
BCADDRESS 21 
BCADDRESS 22 
BCADDRESS_ 23 
BCCHECK_0 
BCCHECK 1 
BCCHECK 2 
BCCHECK 3 
BCCHECK_4 
BCCHECK 5 
BCCHECK 6 
BCCHECK_7 
BCCHECK 8 
BCCHECK 9 
BCCHECK 10 
BCCHECK_ 11 
BCCHECK_ 12 
BCCHECK_13 
BCCHECK 14 
BCCHECK 15 


BCDATAINCLK_0_ 
BCDATAINCLK 0_ 


BCDATAINCLK 1 


BCDATAINCLK 1_ 
BCDATAINCLK 2_ 
BCDATAINCLK 2_ 
BCDATAINCLK 3_ 
BCDATAINCLK 3_ 
BCDATAINCLK 4 __ 
BCDATAINCLK 4 _ 
BCDATAINCLK_5_ 
BCDATAINCLK 5 _ 
BCDATAINCLK 6_ 


Signal Name 
BCDATAINCLK 6 


BCDATAINCLK_7_ 
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Pin Number Type 
D26 OUT 
E27 OUT 
C27 OUT 
B28 OUT 
D28 OUT 
H28 OUT 
G29 OUT 
C29 OUT 
A29 OUT 
B30 OUT 
A31 OUT 
F30 OUT 
H30 OUT 
E31 OUT 
G31 OUT 
D32 OUT 
F32 OUT 
E33 OUT 
C33 OUT 
B34 OUT 
BC7 BI 
AVi2 BI 
BC11 BI 
AY14 BI 
AY38 BI 
BE41 BI 
BB38 BI 
AW35 BI 
BB8 BI 
BE9 BI 
BB12 BI 
AW15 BI 
AW37 BI 
BD40 BI 
BA37 BI 
AV34 BI 
H F8 IN 
L E7 IN 
H P4 IN 
L R5 IN 
H AH4 IN 
L AJ3 IN 
H AY8 IN 
L AW9 IN 
H E39 IN 
L F38 IN 
H R41 IN 
L P42 IN 
H AF 40 IN 
Pin Number Type 
L AG41 IN 
H AV40 IN 
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BCDATAINCLK_7_ 
BCDATAOE L 
BCDATAOUTCLK_0 
BCDATAOUTCLK_0 
BCDATAOUTCLK 1 
BCDATAOUTCLK 1 
BCDATAOUTCLK_ 2 
BCDATAOUTCLK 2 
BCDATAOUTCLK 3 
BCDATAOUTCLK_3 
BCDATAWR_L 
BCDATA 0 
BCDATA 1 
BCDATA 2 
BCDATA 3 
BCDATA_4 
BCDATA 5 
BCDATA 6 
BCDATA_7 
BCDATA 8 
BCDATA_9 
BCDATA 10 
BCDATA 11 
BCDATA 12 
BCDATA 13 
BCDATA_ 14 
BCDATA 15 
BCDATA 16 
BCDATA 17 
BCDATA 18 
BCDATA 19 
BCDATA 20 
BCDATA 21 
BCDATA 22 
BCDATA 23 
BCDATA 24 
BCDATA 25 
BCDATA 26 
BCDATA 27 
BCDATA 28 
BCDATA 29 
BCDATA 30 
BCDATA 31 
BCDATA 32 
BCDATA 33 
BCDATA_34 
BCDATA_35 
BCDATA 36 
BCDATA_37 
BCDATA 38 
BCDATA 39 
Signal Name 
BCDATA_40 
BCDATA 41 
BCDATA 42 
BCDATA 43 
BCDATA 44 
BCDATA 45 
BCDATA_46 
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BCDATA_47 v44 BI 
BCDATA 48 AB38 BI 
BCDATA 49 AB40 BI 
BCDATA_ 50 AC43 BI 
BCDATA 51 AD44 BI 
BCDATA 52 AE45 BI 
BCDATA 53 AH44 BI 
BCDATA 54 AK44 BI 
BCDATA_55 AK38 BI 
BCDATA 56 AL39 BI 
BCDATA_57 AN43 BI 
BCDATA 58 AR45 BI 
BCDATA 59 AP38 BI 
BCDATA_60 AW43 BI 
BCDATA_61 AT38 BI 
BCDATA 62 - . . BA43 . BI 
BCDATA_63 BC41 | BI 
BCDATA 64 D12 BI 
BCDATA 65 A9 BI 
BCDATA 66 F10 BI 
BCDATA_67 C7 BI 
BCDATA 68 D6 BI 
BCDATA 69 G5 BI 
BCDATA_70 D2 BI 
BCDATA 71 L7 BI 
BCDATA 72 F2 BI 
BCDATA.73. .- . K2 BI 
BCDATA 74 R7 BI 
BCDATA_75 M2 BI 
BCDATA 76 U7 BI 
BCDATA_77 TZ BI 
BCDATA_78 v2 BI 
BCDATA_79 Y4 BI 
BCDATA 80 AAL BI 
BCDATA 81 AB8 BI 
BCDATA 82 AD2 BI 
BCDATA_ 83 AE1 BI 
BCDATA 84 AF4 BI 
BCDATA 85 AH2 BI 
BCDATA 86 AK2 BI 
BCDATA 87 AJ7 BI 
BCDATA 88 nt AP2 BI 
BCDATA 89 AL7 BI 
BCDATA 90 AT2 BI 
BCDATA 91 AY2 BI 
BCDATA 92 BA3 BI 
Signal Name Pin Number Type 
BCDATA_ 93 BE3 BI 
BCDATA 94 BB6 BI 
BCDATA_95 BES BI 
BCDATA 96 A35 BI 
BCDATA_97 A37 BI 
BCDATA_ 98 F36 BI 
BCDATA 99 C39 BI 
BCDATA_100 D40 BI 
BCDATA 101 D44 BI 
BCDATA_ 102 J39 BI 
BCDATA 103 F44 BI 
BCDATA 104 L39 BI 
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BCDATA_ 105 L43 BI 


BCDATA 106 L45 BI 
BCDATA_107 R39 BI 
BCDATA_108 U39 BI 
BCDATA 109 R45 BI 
BCDATA 110 Y40 BI 
BCDATA 111 w43 BI 
BCDATA_ 112 AA41 BI 
BCDATA_113 AB44 BI 
BCDATA_ 114 AD38 BI 
BCDATA 115 AE39 BI 
BCDATA 116 AF42 BI 
BCDATA 117 AH38 BI 
BCDATA 118 AL45 BI 
BCDATA 119 AK40 BI 
BCDATA_ 120 AM42 BI 
BCDATA 121 AN41 BI 
BCDATA_ 122 AT44 BI 
BCDATA 123 AR39 BI 
BCDATA 124 AY44 BI 
BCDATA_ 125 AU39 BI 
BCDATA_126 BB44 BI 
BCDATA_127 BD42 BI 
BCLOAD L F26 OUT 
BCTAGDATA 20 F14 BI 
BCTAGDATA 21 G15 BI 
BCTAGDATA 22 H16 BI. 
BCTAGDATA 23 All BI 
BCTAGDATA 24 B12 BI 
BCTAGDATA 25 C13 BI 
BCTAGDATA 26 D14 BI 
BCTAGDATA 27 E15 BI 
BCTAGDATA 28 F16 BI 
BCTAGDATA 29 G17 BI 
BCTAGDATA 30 H18 BI 
BCTAGDATA 31 A15 BI 
BCTAGDATA 32 B16 BI 
BCTAGDATA 33 C17 BI 
BCTAGDATA 34 D18 BI 
BCTAGDATA 35 E19 BI 
BCTAGDATA 36 Al7 BI 
Signal Name Pin Number Type - 
BCTAGDATA 37 F20 BI 
BCTAGDATA 38 D20 BI 
BCTAGDATA 39 G21 BI 
BCTAGDATA 40 E21 BI 
BCTAGDATA 41 A21 BI 
BCTAGDATA 42 H22 BI 
BCTAGDIRTY H G23 BI 
BCTAGINCLK H C19 . IN 
BCTAGINCLK_L B18 IN 
BCTAGOE _L H24 OUT 
BCTAGOUTCLK_H C23 OUT 
BCTAGOUTCLK L B24 OUT 
BCTAGPARITY H F22 BI 
BCTAGSHARED_H B22 BI 
BCTAGVALID H F24 BI 
BCTAGWR_L A25 OUT 
CLKFWDRESET_H BC19 OUT 
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CLKIN_H 
CLKIN_ L 
DCOK_H 
EV6CLK_H 
EV6CLK_L 
FRAMECLK_H 
IRQ H 0 
IRQ_H 1 
IRQ H 2 
IRQ H 3 
IRQ H 4 
IRQ_H 5 
PLLBYPASS_ H 
PLLVDD 
RESET _L 
SROMCLK_H 
SROMDATA_H 
SROMEN_L 
SYSADDINCLK_H 
SYSADDINCLK_L 
SYSADDIN 0 L 
SYSADDIN_ q, _L 
SYSADDIN Pe _L 
SYSADDIN _ 3— _L 
SYSADDIN_ 4_ L 
SYSADDIN_ So L 
SYSADDIN _ 6. L 
SYSADDIN _ 7 “L 
SYSADDIN_ 8 L 
SYSADDIN__ 9 L 
SYSADDIN _ “40. L 
SYSADDIN 11 _L 
SYSADDIN 12 L 
SYSADDIN 13 _L 
SYSADDIN 14 L 
SYSADDOUTCLK_H 
Signal Name 
SYSADDOUTCLK L 
SYSADDOUT_0 cL 
SYSADDOUT _ 1. sedi 
SYSADDOUT 2 L 
SYSADDOUT_ 3 L 
SYSADDOUT _ 4 _L 
SYSADDOUT _ 5 L 
SYSADDOUT 6 L 
SYSADDOUT_7_L 
SYSADDOUT 8 L 
SYSADDOUT_9 L 
SYSADDOUT 10 L 
SYSADDOUT_11_L 
SYSADDOUT_12 L 
SYSADDOUT_13 L 
SYSADDOUT _ 14_ me 
SYSCHECK _ 0 | L 
SYSCHECK _ a _L 
SYSCHECK__ 2. me 
SYSCHECK _ 3° _L 
SYSCHECK _ 4 _L 
SYSCHECK _ 5 i 
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BB20 
AM8 
AP8 

BA19 

AY16 

BA15 

AV16 

BB14 

BC13 

BD12 
AT6 
AT8 

AY20 

AV18 

AW17 

BE15 

BE25 

AW25 

BD28 

BA27 

AY26 

BC27 

BB26 

BA25 

AV24 

AY24 

BD24 

AW23 

BC23 

AY22 

BD22 

BE21 

BA21 

BB32 

Pin Number 

BA31 

BD36 

BC35 

BB34 

BA33 

AY32 

BE35 

BD34 

BC33 

AW31 

AV30 

AY30 

AW29 

AV28 

BE31 

BD30 

AW11 

BD10 
BA13 
BE11 

AV36 
BC39 
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SYSCHECK_6 L 
SYSCHECK _ 7 Be 
SYSDATAINCLK _ 0H 
SYSDATAINCLK _ 0 | _L 
SYSDATAINCLK _ ile H 
SYSDATAINCLK _ se se: 
SYSDATAINCLK _ 2 _H 
SYSDATAINCLK . a2 _L 
SYSDATAINCLK _ 3 _H 
SYSDATAINCLK _ 3 L 
SYSDATAINCLK _ 4 “H 
SYSDATAINCLK 4 L 
SYSDATAINCLK 5 H 
SYSDATAINCLK_5_L 
SYSDATAINCLK _ 6 H 
SYSDATAINCLK _ 6. L 
SYSDATAINCLK _ 7 _H 
SYSDATAINCLK 7 L 
SYSDATAINVALID L 
SYSDATAOUTCLK 0 _H 
SYSDATAOUTCLK _ 0 L 
SYSDATAOUTCLK _ ZL _H 
SYSDATAOUTCLK 1 L 
SYSDATAOUTCLK 2 H 
SYSDATAOUTCLK 2 L 
SYSDATAOUTCLK 3 H 
SYSDATAOUTCLK 3 L 
SYSDATAOUTVALID L 
SYSDATA_0 L 
SYSDATA _ 1 a 
SYSDATA _ as _L 
Signal Name 
SYSDATA 3 L 
SYSDATA _ 4 _L 
SYSDATA 5 L 
SYSDATA_6 L 
SYSDATA_7 L 
SYSDATA 8 L 
SYSDATA 9 L 
SYSDATA _ AO: L 
SYSDATA _ 2. _L 
SYSDATA _ 12 _L 
SYSDATA _ 13 _L 
SYSDATA 14 _ L 
SYSDATA __ 1Ss L 
SYSDATA 16_ L 
SYSDATA _ 17 -: 
SYSDATA _ 18 _ L 
SYSDATA_19 L 
SYSDATA_ 20 _L 
SYSDATA _ o.% L 
SYSDATA _. 22 _L 
SYSDATA _ 235 L 
SYSDATA_ 24 -e 
SYSDATA _ 25. L 
SYSDATA _ 26_ L 
SYSDATA _ ba aes 
SYSDATA 28 L 
SYSDATA 29 L 
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SYSDATA_30_L 
SYSDATA __ 3a: _L 
SYSDATA 32 L 
SYSDATA_33_L 
SYSDATA_ 34 L 
SYSDATA _ 35_ L 
SYSDATA_ 36 | L 
SYSDATA_ 37 “L 
SYSDATA _ 38 | _L 
SYSDATA _ 39 L 
SYSDATA_ 40_ L 
SYSDATA __ 41 “Ll 
SYSDATA __ 427% 
SYSDATA_ 43 L 
SYSDATA 44 L 
SYSDATA_45_ L 
SYSDATA_ 46_ L 
SYSDATA__ 47 _L 
SYSDATA 48 L 
SYSDATA_49 L 
SYSDATA_50_L 
SYSDATA_51_L 
SYSDATA_52_L 
SYSDATA_ 53 | L 
SYSDATA_ 54 _L 
SYSDATA 55 L 
Signal Name 
SYSDATA_56_L 
SYSDATA_57_L 
SYSDATA 58 L 
SYSDATA_59 L 
SYSDATA _ 60_ ¥; 
SYSDATA _ 61. de 
SYSDATA 62 L 
SYSDATA 63 L 
SYSFILLVALID L 
TESTCLK H 
TESTDATAIN_H 
TESTDATAOUT H 
TESTMODESELECT H 
TESTRESET L 
VREFBCACHE 
VREFSYS 


Ground Pins 


BD44 D36 
BB42 F34 
AY40 H32 
AV38 B32 
AV44 D30 
AT42 F28 
AP 40 H26 
AM38 B26 
AM44 D24 
AK42 A23 
AH40 D22 
AF38 B20 
Digital Confidential 


BAT 
BD6 
C35 
H34 
E37 
B40 
G37 


K38 
G43 
M38 
K44 
M44 
T38 
V38 
U45 
AA39 
Y42 
AA45 
AC39 
AD40 
AE41 
AG43 
AJ45 
AJ39 
AL41 
Pin Number 

AM40 
AP 44 
AU45 
AT40 
BA45 
AY42 
BC45 
BE43 
BC29 
BC17 
BB18 
BD18 
BD16 
BE17 
AW21 
AV22 


AV26 
AY28 
BB30 
BD32 
AV32 
AY34 
BB36 
BD38 
AV38 
BD26 
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AF 44 H20 AU7 


AD42 F18 AT4 
AC45 D16 AV2 
AB42 B14 AY6 
Y38 H14 BB4 
Y44 F12 BD2 
V40 D10 AV8 
T42 B8 AV14 
P44 G9 AY12 
P38 B4 BB10 
M40 B2 BD8 
K42 D4 AV20 
H44 F6 AY18 
H38 H2 BB16 
F40 K4 BD14 
D42 M6 BD20 
B44 P8 BB22 
B42 P2 BB24 
B38 v6 BE23 
VDD_2V 

BC43 Al19 AW13 
BA41 G19 BE13 
AW45 E17 BC15 
AU43 Crs BA17 
AR41 Al3 AW19 
AN39 G13 BE19 
AN45 Bid BC21 
AL43 CY BA23 
AJ41 A7 BC25 
AG39 A5 BE27 
AG45 AW27 
AE43 A3 BA29 
AC41 C3 BC31 
AA43 E5 BE33 
W45 G7 AW33 
W39 ion +h es BA35 egos Z te 
U4l1 J3 BC37 
R43 LS BE39 
N45 N7 AW39 
N39 Nl 
L41 R3 
J43 US 
G45 W7 
G39 wW1 
E41 AA3 
C43 ACS 
A43 AE3 

AG1 
A4l AG7 
H36 AJ5 
A39 AL3 
C37 AN1 
E35 AN7 
G33 ARS 
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A33 AU3 
C31 AW1 
E29 BC3 
G27 BAS 
A27 AW7 
C25 BE7 
E23 . BC9 
C21 BA11 
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11. Appendix 1: Reset and Sleep Mode 


This chapter contains reset and sleep mode information. It has not been written yet. 
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12. Appendix 2: PAL Coding Restrictions 


12.1 Restriction: Reset Sequence Required by Retirator and Mapper 


(a) For convenience of implementation, the retirator "done" status bits are not initialized during reset. 
Instead, it relies on the first batch of valid instructions to sweep through inum-space and initialize 
these bits. The 80 status bits, corresponding to the maximum number of inflight instructions, must be 
marked "not done" by the first 80 instructions mapped after reset and subsequently marked "done" 
when those instructions retire. Therefore, the first 20 fetch blocks must contain 4 valid instructions 
apiece, and containing no retirator-nops (see previous guideline). 


Example: 

reset: 

ADDQ R31,#19,RO 
ADDQ R31,R0,RO 
ADDQ R31,R0,RO 
ADDQ R31,R0,RO 
loop: 

SUBQ RO,#1,RO 
ADDQ R31,R0,RO 
ADDQ R31,R0,RO0 
BNE RO, loop 
continue: 


Note that all four instructions in each fetch block are valid and none have R31 as a destination. (b) 
For convenience of implementation, the mapper requires that all virtual registers (architected and 
PAL shadow, excluding R31 and F31) be used as destinations before they are used as sources. In 
other words, the hardware does not create the "initial mapping” of virtual-to-physical registers; it 
relies on software. Since there is no hardware-created initial mapping, a virtual register cannot be 
used as a source operand before it is mapped. An example initial mapping sequence is as follows: 


ADDQ R31,R31,RO 
ADDQ R31,R31,R1 
ADDQ R31,R31,R2 - 
ADDQ R31,R31,R3 


ADDQ R31,R31,R4 
ADDQ R31,R31,R5 
ADDQ R31,R31,R6 
ADDQ R31,R31,R7 


ADDQ R31,R31,R28 
ADDQ R31,R31,R29 
ADDQ R31,R31,R30 
ADDQ R31,R31,R1 ; note that R31 need not be initialized as a destination 


ADDF F31,F31,F0 
ADDF F31,F31,F1 
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ADDF F31,F31,F2 
ADDF F31,F31,F3 


ADDF F31,F31,F4 
ADDF F31,F31,F5 
ADDF F31,F31,F6 
ADDF F31,F31,F7 


ADDF F31,F31,F28 
ADDF F31,F31,F29 
ADDF F31,F31,F30 
ADDF F31,F31,F1 ; note that R31 need not be initialized as a destination 


Note that this sequence can be used to initialize the retirator staus bits as well. 


12.2 Restriction: No Multiple Writers to IPRs in Same Scoreboard Group 


Only one explicit writer (HW_MTPR) to IPRs that are in the same group can appear in the same fetch 
block (octaword-aligned octaword). Multiple explicit writers to IPRs that are NOT in the same 
scoreboard group can appear. If this restriction is violated the IPR readers might not see the in-order 
state. Also, the IPR might ultimately end up with a bad value. This is for convenience of 
implementation. 


12.3 Restriction: (removed) 


12.4 Restriction: No Writers and Readers to IPRs in Same Scoreboard 
Group 


Within one fetch block (octaword-aligned octaword), an implicit or explicit reader of an IPR ina 
particular Scoreboard Group an not follow an explicit writer (HW_MTPR) to an IPR in that 
scoreboard group. This is for convenience of implementation. Note that implicit readers include all 
memory operations and JSR/HW_RET. | 


12.5 Restriction: PAL shadow enables 


Once pal shadows are enabled (via I CTL<SDE>), the NT-mode (ILCTL<NT_MODE>).state must 
not be changed. Enabling PAL shadows will allow the assignment of 8 physical registers to the 8 
additional general-purpose register specifier as determined by I CTL<NT_MODE>. Subsequent 
changing of I CTL<NT_MODE> will assign 8 additional physical registers to the specifiers in the 
new overlay range but will not deallocate the prior 8 registers. The net effect is that 8 physical 
register will be removed from the resource pool. 


12.6 Guideline: Avoid Consecutive read-modify-write-read-modify-write 
sequences to IPRs in the Same Scoreboard Group 


The latency between the first write and the second read is determined by the retire latency of the IPR. 
For convenience of implementation, the latency between when the read issues and the final write 
issues depends on the runtime contents of the issue queue. It is somewhere between 4 and 9 cycles 
even if there is no data dependency between the read and write. 
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12.7 Restriction: Replay trap and interrupt code sequence and STF/ITOF 


On an MBOX replay trap, the EV6 Ibox guarantees that the refetched load or store that caused the 
trap will issue before any newer loads or stores. For loads and integer stores this is a consequence of 
the natural operation of the issue queue. The refetched instruction enter the age-prioritized queue 
ahead of newer loads and stores will not have any dependencies on dirty registers. Since there is no 
time-overhead for checking these register dependencies (i.e. it is known upon enqueueing that there 
are no dirty registers) The queue will issue it in priority order. For floating stores, there is normally 
some overhead associated with checking the floating point source register dirty status so the store 
would normally wait before issuing. This would have the undesired consequence of allowing newer 
loads and stores to issue out-of-order. A deadlock can occur if this out-of-order issue causes the 
floating store to continually replay trap. To avoid the deadlock, on a floating store replay trap, the 
source register dirty status is not checked (the source register is assumed to be clean because the store 
was issued before). 

The hardware mechanism which keeps track of replayed floating stores and cancels the dirty register 
check requires some software restrictions to guarantee that it is applied appropriately to the replayed 
instruction and not to other floating stores. It operates by marking the position in the fetch block (low 
two bits of the PC) where the replay trap occurred and then canceling the floating point dirty source 
register check of the next valid instruction enqueued to the integer queue (integer instructions, all 
loads and stores, and ITOF) which has the same position in the fetch block (normally the replayed 
STF). If the PC is somehow diverted to a PAL flow, this hardware might inadvertently cancel the 
register check of some other STF (or ITOF) instruction. Fortunately, there are a minimal number of 
reasons why the PC might be diverted during a replay trap. They are: 

Interrupts 

ITB Fill 

(others?) 

In these PAL flows, a STF or ITOF instruction in a given position in a fetch block must be preceded 
by a valid instruction that is issued out of the integer queue in the same position in an earlier fetch 
block. Acceptable instruction classes include loads, integer stores, integer operates that do not have 
R31 as a destination, branches. 

Example: 

Bad_Interrupt_Flow_Entry: 

ADDQ R31,R31,R0 

STF Fa,(Rb) ; this STF might NOT undergo a dirty source register check and might give wrong 
results ; 

ADDQ R31,R31,RO0 

ADDQ R31,R31,RO 


Good_Interrupt_Flow_Entry: 

ADDQ R31,R31,RO ; enables FP dirty source register check for (PC<1:0> == 00) 
ADDQ R31,R31,RO ; enables FP dirty source register check for (PC<1:0> == 01) 
ADDQ R31,R31,RO ; enables FP dirty source register check for (PC<1:0> == 10) 
ADDQ R31,R31,RO ; enables FP dirty source register check for (PC<1:0> == 11) 


ADDQ R31,R31,RO 
STF Fa,(Rb) ; this STF will successfully undergo a dirty source register check 


ADDQ R31,R31,RO 


ADDQ R31,R31,RO 
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12.8 Restriction (removed) 


12.9 Restriction: PALmode I-Stream address ranges 


e PALmode<physical> I-Stream addresses must insure proper sign extension for the selected value of 
I_CTL<VA_WIDE>. When I_CTL<VA_WIDE> is clear, indicating 43-bit virtual address format, 
PALmode<physical> I-Stream addresses must sign extend address bits above bit 42 although physical 
address range is 44 bits. An illegal address can only be generated by a PALmode JSR-type instruction 
or a HW_RET instruction returning to a PALmode address. 


12.10 Restriction: Duplicate IPR mode bits 


e Duplicate IPR mode bits I CTL<VA_WIDE> and VA_CTL<VA_WIDE>, I_CTL<NT_MODE> and 
VA_CTL<NT_MODE> must be equal when executing in native(virtual) mode. 


12.11 Guideline: lbox IPR update synchronization — 


e When updating any Ibox IPR, a return to native(virtual) mode should use the HW_RET instruction 
with associated STALL bit set to insure that the updated IPR value affects all instructions following 
the return path. The new IPR value takes effect only after the associated HW_MTPR instruction 
retires. 


12.12 Restriction: HW_MFPR EXC_ADDR/AVA_FORM/EXC_SUM Usage 


e These three registers are sourced by non-renamed hardware registers that need to be available for 
- - Subsequent traps. Hardware protects the values from overwrite by locking the registers, but only for a 
limited time. Their values can only be read reliably by a HW_MFPR within the first four instructions 
of a PALflow AND prior to any taken branch in that PALflow, whichcver is earlier. After the S50 
delimiting instruction defined above retires, these registers are unlocked and may change due to new - 
exception conditions. 

e Ifasecond exception occurs before the registers are unlocked, it will be either delayed or forced to 
replay trap until the register has been unlocked. After being unlocked, a subsequent, new path 
exception condition will be allowed to reload the register and trap to PAL. Note that the CPU may 
complete execution of the first PAL flow, encountering the second exception condition before the 
delimiting instruction retires, hence the need for the locking mechanism to insure visibility of the 
initial register value. : 


12.13 Restriction: DTB FILL flow collision 


¢ Two DTB Fill flows might collide such that the HW_MTPR's in the second fill could issue before all 
of the HW_MTPR’s in the first flow retired. This can be prevented by putting appropriate software 
scoreboard barriers in the PAL flow. 


12.14 Restriction: HW_RET 


e No hw_ret in the first fetch block of PAL routine. The HW_RET will be mispredicted and the 
JSR/RETURN stack might lose its synchronization. 


12.15 Restriction: (REMOVED) 


12.16 Restriction: JSR-BAD VA 


e AJSR memory format instruction which generates a bad VA (IACV) trap requires PAL assistance to 
determine the correct exception address. If the EXC_SUM<BAD_IVA> bit is set, bits <63,1> of the 
exception address are valid in the VA IPR and not the EXC_ADDR as usual. The PALmode bit, 
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however, is always located in EXC_ADDR<0> and must be combined, if necessary, by PALcode to 
determine the full exception address. 


12.17 Restriction: MTPR to DTB_TAGO/DTB_PTE0O/DTB_TAG1/DTB_PTE1 


e These four writes must be executed atomically, i.e. either all four must issue and retire or none of 
them may issue and retire. 


12.18 Restriction: No FP OPERATES or FP CONDITIONAL BRANCHES in 
same fetch block as MTPR 


e For convenience of implementation, no floating point operate instructions or FP conditional branches 
in the same fetch block as any move-to-processor register instructions. This inludes 
ADDF/MULF/DIVF/FBxx but does not include LDF/STF or ITOF/FTOI. 


12.19 Restriction: HW_RET/STALL after updating the FPCR via MT_FPCR 
in PALmode 


e FPCR updating happens in hardware based on the retire of nontrapping version of MT_FPCR (in 
PALcode). Use a HW_RET/STALL after the nontrapping MT_FPCR to achieve minimum latency (4 
cycles) between the retiring of the MT_FPCR and the first FLOP that uses the updated FPCR. 


12.20 Guideline: | CTL SBE Stream Buffer Enable 


e The I_CTL(SBE) bits should not be enabled when running with the Icache disabled to avoid 
potentially long fill delays. When the Icache is disabled, the only method of supplying instructions is: - 
via a stream hit. If the fill is returned in non-sequential wrap order, the stream will continue fetching 
through the entire page while waiting for a hit. Normally the data will be found in the cache. 


12.21 Restriction: HW_RET/STALL after MT ASNO/ASN1 


e There must be a scoreboard bit -> register dependency chain to prevent MT ASNO or MT ASN1 from 
issuing while ANY of scoreboard bits <7:4> are set. A code sequence which accomplishes this: 


; assume Ra holds value to write to ASNO/ASN1 
HW_MEFPR IPR_VA,SCBD<7,6,5,4>,R0 

XOR RO,RO,RO 

BIS RO,R9.R9 

BIS R31,R31,R31 a 

HW_MTPR R9,ASNO,SCBD<4> 

HW_MTPR R9,ASN1,SCBD<7> 


e This sequence guarantees, through the register dependency on RO, that neither HW_MTPR are issued 
before scoreboard bits <7:4> are cleared. In addition, there must be a HW_RET/STALL after a MT 
ASNO/MT ASN1 pair. Finally, these two writes must be executed atomically, i.e. either both must 
issue and retire or neither may issue and retire. 


12.22 Restriction: HW_RET/STALL after MT ISO/S1 


e There must be a scoreboard bit -> register dependency chain to prevent MT ISO or MT IS1 from 
issuing while ANY of scoreboard bits <7:4> are set. A code sequence which accomplishes this: 


HW_MFPR IPR_VA,SCBD<7,6,5,4>,RO 
XOR RO,RO,RO 
BIS RO,R9,R9 
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BIS R31,R31,R31 
HW_MTPR R9,ISO,SCBD<6> 
HW_MTPR R9,IS1,SCBD<7> 


e This sequence guarantees, through the register dependency on RO, that neither HW_MTPR are issued 
before There must be a HW_RET/STALL after a MT ISO/MT IS1 pair. Also, these two writes must be 
executed atomically, i.e. either both must issue and retire or neither may issue and retire. 


12.23 Restriction: HW_ST/P/CONDITIONAL does not "clear" the lock flag 


e AHW_ST/P/CONDITIONAL will not "clear" the lock flag such that a successive store-conditional 
(either STx_C or HW_ST/C) might succeed even in the absence of a load-locked instruction. In EV6 
a store-conditional is forced to fail if there is an intervening memory operation between the store- 
conditional and its address-matching LDxL. The memory operations are: 


LDL/Q/F/G/S/T 

STL/Q/F/G/S/T 

LDQ_U (not to R31) 

STQ_U 

Absent from this list are HW_LD (any type), HW_ST (any type), ECB, and WH64. Their absence implies 
that they will NOT force a subsequent store-conditional instruction to fail. PALcode MUST insert a 
memory operation from the above list after a HW_ST/CONDITIONAL in order to force a future store- 
conditional to fail if it was not preceded by a load-locked: 

HW_LD/L 

XXX 

HW_ST/C -> RO 

Bxx RO, try_again 

STQ /*force next ST/C to fail if no preceding LDxL */ 

HW_RET 


12.24 Restriction: HW_RET/STALL after MT ITB_IA, ITB_IAP, IC_FLUSH 


e There must be aHW_RET/STALL after a MT ITB_IA, ITB_IAP or IC_FLUSH. The Icache flush 
associated with these instructions will not occur until the HW_RET stall occurs and all outstanding I- 
stream fetches have been completed. 


12.25 Restriction: MT ITB_IA after Reset 


¢ An Mz ITB_IA is required’in the reset PALcode to initialize the ITB. It is also required that ~ 
PALcode not be exited, even via a mispredicted path until this MT ITB_IA has retired. PALmode can 
change temporarily after fetching a HW_RET, regardless of the STALL qualifier, down a 
mispredicted path leading to use of the ITB before it is actually initialized. 

e Unexpected instruction fetch and execution can occur following misprediction of any memory format 
Control instruction (IMP,JSR,RET,JSR_CO, or HW_JMP,HW_JSR,HW_RET, HW_JSR_CO 
regardless of the STALL qualifier), or after any mispredicted conditional branch instruction. If the 
unexpected instruction flow contains a HW_RET instruction, PALmode may be exited prematurely. 

e One way to insure that PALmode is not exited is to place the MT ITB_IA at least 80 instructions 
before any possible HW_RET instruction can be encountered via any fetch path. Since memory 
format Control instructions can mispredict to any cache location, they should also be avoided within 
these 80 instructions. 


12.26 Guideline: Conditional branches in PALcode 


e To avoid pollution of the branch predictors and improve overall branch prediction accuracy, 
conditional branch instructions in PALcode will be predicted not taken. The only exception to this 
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tule are conditional branches within the first cache fetch (up to four instructions) of all pal flows 
except call_pal flows. It is advisable that conditional branches be avoided in this window. 


12.27 Restriction: Reset of 'Force-Fail Lock Flag’ State in PALcode 


e A virtual mode load or store is required in PAL code before the execution of any load-locked or store- 
conditional instructions. The virtmal-mode load or store may not be a HW_LD, HW_ST, LDx_L, 
ECB, or WH64. 
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